• DocumentCode
    2118129
  • Title

    Automatic Extraction of Blog Post from Diverse Blog Pages

  • Author

    Chia-Hui Chang ; Jhih-Ming Chen

  • Author_Institution
    Dept. of Comput. Sci. & Inf. Eng., Nat. Central Univ., Chungli, Taiwan
  • Volume
    1
  • fYear
    2012
  • fDate
    4-7 Dec. 2012
  • Firstpage
    129
  • Lastpage
    136
  • Abstract
    Blog post extraction is essential for researches on blogosphere. In this paper, we address the issue of extracting blog posts from diverse blog pages, which aims at automatically and precisely finding the location of each blog post. Most of the previous researches focused on extracting main content from news pages, but the problem becomes more complex when one turns to blog pages. Our research is based on the combination of maximum scoring subsequence [11] and text-to-tag ratio [18] to develop algorithms that are suitable for blog pages. The first method that we propose is PTR Scoring, which combines postto-tag ratio with maximum scoring subsequence. The second method is CRF Scoring, which applies Conditional Random Field to train a sequence labeling model and use maximum scoring subsequence to improve the accuracy of extraction. The experimental results show that CRF Scoring achieves the best F-Measure at 91.9% compared with other methods.
  • Keywords
    Web sites; information retrieval; CRF scoring; F-Measure; PTR scoring; automatic blog post extraction; blog pages; blogosphere; conditional random field; main content extraction; maximum scoring subsequence combination; post-to-tag ratio; sequence labeling model; text-to-tag ratio; blog post extraction; maximum sequence; sequence labeling;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Web Intelligence and Intelligent Agent Technology (WI-IAT), 2012 IEEE/WIC/ACM International Conferences on
  • Conference_Location
    Macau
  • Print_ISBN
    978-1-4673-6057-9
  • Type

    conf

  • DOI
    10.1109/WI-IAT.2012.25
  • Filename
    6511875