Title :
A text block context informations based multiple Web contents extraction
Author :
Wonmoon Song;Myungwon Kim
Author_Institution :
Strategic Business Team, ONYCOM, Seoul, Republic of Korea
Abstract :
In Web environment, in order to provide appropriate Web services to users´ needs it becomes important to quickly and accurately extract from Web documents contents such as main-content, menu-list, article-list, comments and so on. In this paper, we propose an efficient method that extracts various contents from Web documents. In the method, text blocks are separated from the document and context information is extracted and used to classify content type of each text block. Context information consists of documenting patterns and structural features of a Web document. For documenting patterns, we use in/out link information, which is expanded from word/link density proposed by a previous work. For structural features, distances between text blocks and parent tags of the target text block are used. We experimented with our method using a published data set and a data set that we collected. The experiment results show that our method performs about 17% points better in accuracy for multiple contents extraction and about 14% points better in F-measure for main-content extraction compared to the existing methods.
Keywords :
"Feature extraction","HTML","Context","Visualization","Data mining","Standards","XML"
Conference_Titel :
Data Science and Advanced Analytics (DSAA), 2015. 36678 2015. IEEE International Conference on
Print_ISBN :
978-1-4673-8272-4
DOI :
10.1109/DSAA.2015.7344829