DocumentCode :
2023915
Title :
A Fast and Accurate Approach for Main Content Extraction Based on Character Encoding
Author :
Mohammadzadeh, Hadi ; Gottron, T. ; Schweiggert, Franz ; Nakhaeizadeh, Gholamreza
Author_Institution :
Inst. of Appl. Inf. Process., Univ. of Ulm, Ulm, Germany
fYear :
2011
fDate :
Aug. 29 2011-Sept. 2 2011
Firstpage :
167
Lastpage :
171
Abstract :
This paper presents a novel approach for extracting the main content from Web documents written in languages not based on the Latin alphabet. In practice, the HTML tags are based on the English language and, certainly, the English character set is encoded in the interval [0,127] of the Unicode character set. On the other hand, many languages, such as the Arabic language, use a different interval for their characters. In the first phase of our approach, we apply this distinction for a fast separation of the Non-ASCII from the English characters. After that, we determine some areas of the HTML file with high density of the Non-ASCII character set and low density of the ASCII character set. At the end of this phase, we use this density to identify the areas which contain the main content. Finally, we feed those areas to our parser in order to extract the main content of the Web page. The proposed algorithm, called DANA, exceeds alternative approaches in terms of both, efficiency and effectiveness, and has the potential to be extended also to languages based on ASCII characters.
Keywords :
Internet; hypermedia markup languages; information retrieval; natural languages; Arabic language; DANA; English character set; English language; HTML tags; Web documents; Web page; character encoding; main content extraction; nonASCII; unicode character set; Electronic publishing; Encoding; Encyclopedias; HTML; Internet; Web pages; ASCII and Non-ASCII character set; HTML Documents; Information Retrieval; Main Content Extraction; UTF-8;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Database and Expert Systems Applications (DEXA), 2011 22nd International Workshop on
Conference_Location :
Toulouse
ISSN :
1529-4188
Print_ISBN :
978-1-4577-0982-1
Type :
conf
DOI :
10.1109/DEXA.2011.2
Filename :
6059811
Link To Document :
بازگشت