مرکز منطقه ای اطلاع رساني علوم و فناوري - Automatic Web Content Extraction for Generating Tag Clouds from Thai Web Sites

DocumentCode :

2666937

Title :

Automatic Web Content Extraction for Generating Tag Clouds from Thai Web Sites

Author :

Thanadechteemapat, Wigrai ; Fung, Chun Che

Author_Institution :

Sch. of Inf. Technol., Murdoch Univ., Murdoch, WA, Australia

fYear :

2011

fDate :

19-21 Oct. 2011

Firstpage :

Lastpage :

Abstract :

This paper proposes a novel Web content extraction approach based on heuristic rules and the XPath utility in XML. The main objective is to address the problem of Web visualization by generating tag clouds from Thai Web sites in order to provide an overview of the key words in the Web pages. This paper also proposes a detailed method to assess the Web content extraction technique on a single Web page by using the length of the extracted content. There are three main steps in the proposed technique: Web page elements and features extraction, Block detection, and Content extraction selection. The empirical results have shown this technique produces high accuracies.

Keywords :

Web sites; XML; cloud computing; information retrieval; Thai Web sites; Web visualization; XML; XPath utility; automatic Web content extraction; block detection; content extraction selection; features extraction; heuristic rules; tag clouds; Accuracy; Feature extraction; Noise; Tag clouds; Visualization; Web pages; Tag clouds; Web Content Extraction; XPath;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

e-Business Engineering (ICEBE), 2011 IEEE 8th International Conference on

Conference_Location :

Beijing

Print_ISBN :

978-1-4577-1404-7

Type :

conf

DOI :

10.1109/ICEBE.2011.34

Filename :

6104601

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2666937