• DocumentCode
    2316829
  • Title

    Improving Webpage Content Extraction by extending a novel single page extraction approach: A case study with Thai websites

  • Author

    Thanadechteemapat, Wigrai ; Fung, Chun Che

  • Author_Institution
    Sch. of Inf. Technol., Murdoch Univ., Perth, WA, Australia
  • Volume
    4
  • fYear
    2012
  • fDate
    15-17 July 2012
  • Firstpage
    1263
  • Lastpage
    1267
  • Abstract
    Web Content Extraction technique is proposed in this paper. The technique is able to work with both single and multiple pages based on heuristic rules. An Extracted Content Matching (ECM) technique is proposed in the multiple page extraction to identify the noises among the extracted results. Some features in this technique are also introduced in order to reduce processing time such as use of XPath, file compression, and parallel processing. Assessment of the performance is based on precision, recall and F-measure by using the length of extracted content. Initial results by comparing results from the proposed approach to extraction by manual process are good.
  • Keywords
    Internet; Web sites; ECM technique; F-measure; Thai Web sites; Web page content extraction improvement; XPath; extracted content matching technique; file compression; heuristic rules; manual process; multiple page extraction; parallel processing; precision; recall; single page extraction approach; Abstracts; Channel hot electron injection; Integrated optics; Visualization; Extracted Content Matching (ECM); Web Content Extraction; XPath;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Machine Learning and Cybernetics (ICMLC), 2012 International Conference on
  • Conference_Location
    Xian
  • ISSN
    2160-133X
  • Print_ISBN
    978-1-4673-1484-8
  • Type

    conf

  • DOI
    10.1109/ICMLC.2012.6359546
  • Filename
    6359546