Title of article

Main Content Extraction from Detailed Web Pages

Author/Authors

Mohsen Asfia، نويسنده , , Mir Mohsen Pedram، نويسنده , , Amir Masoud Rahmani، نويسنده ,

Issue Information

روزنامه با شماره پیاپی سال 2010

Pages

4

From page

18

To page

21

Abstract

As we know internet detailed web pages contains information which are not considered as primary content such as advertisements, headers, footers, navigation links and copyright information. Also information on web pages such as comments and reviews are not preferred by search engines to index as informative content, thereby having an algorithm to extracts only main content could help better quality on web page indexing. Almost all algorithms have been proposed are tag dependent means they could only look for primary content among specific tags such as or

. The algorithm in this paper simulates a web page user visit and how the user finds the main content block position in the page. The proposed method is tag independent and has two phases to accomplish the extraction job. First it transforms input DOM tree obtained from input HTML detailed web page into a block tree based on their visual representation and DOM structure in a way that on every node it will have specification vector, then it traverses the obtained small block tree to find main block having dominant computed value in comparison with other block nodes based on its specification vector values. The introduced method doesnʹt have any learning phases and could find informative content on any random input detailed web page. This method has been tested in large variety of websites and as we will show, it gains better precision and recall based on other compared method K-FE.

Keywords

noise elimination , WEB MINING , Informative content , Information retrieval , Information extraction

Journal title

International Journal of Computer Applications

Serial Year

2010

Journal title

International Journal of Computer Applications

Record number

659926

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=10&DC=659926