Title :
Content Extraction from Chinese Web Pages Based on Punctuations Distribution
Author :
Peng, Qian ; Wang, Qinglin ; Li, Yuan ; Zhang, Jixian ; Hao, Yuexing
Author_Institution :
Sch. of Autom. Beijing, Inst. of Technol., Beijing, China
Abstract :
Content extraction from web pages is a significant technology to obtain information resources from the Internet. This paper proposes an effective and universal approach to extract content from a HTML page by taking advantages of Chinese punctuation distribution. Firstly, through computing the distribution of the Chinese punctuations in the HTML source, a position which is inside the web page content is found. Then, starting from the position, the content of the HTML source is extracted by computing the left and right boundary. Finally, within the left and right boundary, the content is extracted. Experiment result shows that the accuracy of the algorithm reaches above 98%.
Keywords :
Internet; hypermedia markup languages; information resources; information retrieval; natural language processing; Chinese punctuation distribution; Chinese web pages; HTML page; Internet; content extraction; information resources; left boundary computation; right boundary computation; Accuracy; Data mining; Feature extraction; HTML; Kernel; Navigation; Web pages; content extraction; kernel punctuation; punctuation distruction;
Conference_Titel :
Computer Science & Service System (CSSS), 2012 International Conference on
Conference_Location :
Nanjing
Print_ISBN :
978-1-4673-0721-5
DOI :
10.1109/CSSS.2012.341