DocumentCode :
684851
Title :
A new algorithm: Extracting text information from Webpage based on block and tag-function
Author :
Dingrong Yuan ; Xiaohu Yang ; Xue Nong ; Huiwen Fu
Author_Institution :
Coll. of Comput. Sci. & Inf. Technol., Guangxi Normal Univ., Guilin, China
fYear :
2012
fDate :
7-9 Dec. 2012
Firstpage :
1
Lastpage :
4
Abstract :
A Webpage contains lots of information that users needed, however it also fills with plenty of noises. How to remove these noises and extract useful text information has become one of the hottest topics in the field of Web data mining. This paper proposes a text information extraction algorithm based on visual information and tag-function. In this algorithm, firstly a webpage is divided into different blocks, and then we extract text information from these blocks based on rules, which are extracted from the characteristics of tag-function. Experiments show that the algorithm is effective and efficient.
Keywords :
Web sites; data mining; text analysis; Web data mining; Webpage; block-function; tag-function; text information extraction algorithm; DOM tree; information extraction; tag-function; text information; visual block;
fLanguage :
English
Publisher :
iet
Conference_Titel :
Information Science and Control Engineering 2012 (ICISCE 2012), IET International Conference on
Conference_Location :
Shenzhen
Electronic_ISBN :
978-1-84919-641-3
Type :
conf
DOI :
10.1049/cp.2012.2437
Filename :
6755816
Link To Document :
بازگشت