Title :
A webpage information extraction method based on game theory
Author :
Bohai Yu;Zhang Xia;Zhengyou Xia
Author_Institution :
College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, China
fDate :
7/1/2015 12:00:00 AM
Abstract :
As web2.0 developing many websites provide information on its own CMS (content management system) especially for news websites. How to extract information from different webpage is becoming more and more popular to research. Many researchers have proposed plenty of methods that can extract valid content adaptively. In this paper we have proposed a method based on game theory to efficiently extract the main text from webpage. We will find the target label by using label game. Our method is consisted of two steps: (a). Filtering the script and style tags in the Webpage, and then dividing entire html page into many blocks by using div tag; (b). extracting features from the blocks and find the Nash equilibrium from game theory matrix. By making plenty of experiments on some websites, it verifies that our model based on game theory is valid and better.
Conference_Titel :
Smart and Sustainable City and Big Data (ICSSC), 2015 International Conference on
DOI :
10.1049/cp.2015.0252