DocumentCode :
2893884
Title :
A Comprehensive Survey on Web Content Extraction Algorithms and Techniques
Author :
Al-Ghuribi, Sumaia Mohammed ; Alshomrani, Saleh
Author_Institution :
Fac. of Comput. & Inf. Technol., King Abdulaziz Univ., Jeddah, Saudi Arabia
fYear :
2013
fDate :
24-26 June 2013
Firstpage :
1
Lastpage :
5
Abstract :
Web Content Extraction is an important problem that has been studied through different approaches and algorithms. It is interested in extracting meaningful and useful data from the Webpage which is surrounded with many noisy data such as advertisements and navigation links. Many applications get benefits from the extracted content such as crawlers, indexers, document classification, and Information retrieval. This survey aims at providing a comprehensive overview of many approaches that constructed for extracting Webpage content. In this survey, Web Content Extraction approaches are classified into categories and for each category, some approaches are given in details with their weakness. Based on analyzing the given approaches deeply, we can draw the fundamentals factors for constructing the optimal Web content extractor.
Keywords :
Web sites; content management; data mining; pattern classification; Web crawlers; Webpage content extraction algorithm; Webpage content extraction technique; Webpage data extraction; advertisement links; document classification; indexers; information retrieval; navigation links; noisy data; optimal Web content extractor; Algorithm design and analysis; Classification algorithms; Data mining; Feature extraction; HTML; Visualization; Web sites;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Information Science and Applications (ICISA), 2013 International Conference on
Conference_Location :
Suwon
Print_ISBN :
978-1-4799-0602-4
Type :
conf
DOI :
10.1109/ICISA.2013.6579445
Filename :
6579445
Link To Document :
بازگشت