Attribute based content mining for regional web documents

Author

Prakash, Kolla Bhanu ; Dorai Rangaswamy, M.A. ; Raman, Arun Raja

Author_Institution

Sathyabama Univ., Chennai, India

fYear

2013

fDate

12-14 Dec. 2013

Firstpage

368

Lastpage

373

Abstract

The rapid expansion of the Internet has made the WWW a popular place for disseminating and collecting information. Extracting useful information from Web pages thus becomes an important task. Generally, apart from the main content blocks, web pages usually have such blocks as navigation bars, copyright and privacy notices, relevant hyperlinks, and advertisements, which are called noisy blocks. Although such information items are functionally useful for human viewers and necessary for the Web site owners, they often hamper Web page clustering, classification, information retrieval and information extraction. Today, people use the Web for a large variety of activities including travel planning, comparison shopping, entertainment, and research. However, the tools available for collecting, organizing, and sharing Web content have not kept pace with the rapid growth in information. But the major complexity arises when web documents or information is in regional languages. Extracting the content of the document and later communication through oral or text means is quite involved as both syntax and symantics are needed for this. Depending on the form and structure of the web document this task becomes difficult and this is the area the current paper addresses through a novel approach based on the pixel maps and using this how content could be extracted and knowledge is created in the minds of illiterate user. The paper first presents how letters and words which form the basis of text-based communication can be used for content. The objective of this task is to achieve a concept-based term analysis on the sentence and document levels rather than a single-term analysis in the document set only. This paper outlines the use of attributes for content extraction, using basic pixel attributes and pattern matching, statistical model and pattern matching and Artificial Neural Network training.

Keywords

Internet; data mining; document handling; Internet; WWW; Web content; Web page clustering; Web site owners; advertisements; artificial neural network training; attribute based content mining; basic pixel attributes; comparison shopping; content extraction; copyright; entertainment; information extraction; information retrieval; navigation bars; pattern matching; pixel maps; privacy notices; regional Web documents; regional languages; relevant hyperlinks; research; statistical model; travel planning; ANN; Media Mining; Multi-Lingual; Statistical Interpretation;

fLanguage

English

Publisher

iet

Conference_Titel

Sustainable Energy and Intelligent Systems (SEISCON 2013), IET Chennai Fourth International Conference on

Conference_Location

Chennai

Print_ISBN

978-1-78561-030-1

Type

conf

DOI

10.1049/ic.2013.0340

Filename

7119727