Author :
Prakash, Kolla Bhanu ; Dorai Rangaswamy, M.A. ; Raman, Arun Raja
Abstract :
The rapid expansion of the Internet has made the WWW a popular place for disseminating and collecting information. Extracting useful information from Web pages thus becomes an important task. Generally, apart from the main content blocks, web pages usually have such blocks as navigation bars, copyright and privacy notices, relevant hyperlinks, and advertisements, which are called noisy blocks. Although such information items are functionally useful for human viewers and necessary for the Web site owners, they often hamper Web page clustering, classification, information retrieval and information extraction. Today, people use the Web for a large variety of activities including travel planning, comparison shopping, entertainment, and research. However, the tools available for collecting, organizing, and sharing Web content have not kept pace with the rapid growth in information. But the major complexity arises when web documents or information is in regional languages. Extracting the content of the document and later communication through oral or text means is quite involved as both syntax and symantics are needed for this. Depending on the form and structure of the web document this task becomes difficult and this is the area the current paper addresses through a novel approach based on the pixel maps and using this how content could be extracted and knowledge is created in the minds of illiterate user. The paper first presents how letters and words which form the basis of text-based communication can be used for content. The objective of this task is to achieve a concept-based term analysis on the sentence and document levels rather than a single-term analysis in the document set only. This paper outlines the use of attributes for content extraction, using basic pixel attributes and pattern matching, statistical model and pattern matching and Artificial Neural Network training.
Keywords :
Internet; data mining; document handling; Internet; WWW; Web content; Web page clustering; Web site owners; advertisements; artificial neural network training; attribute based content mining; basic pixel attributes; comparison shopping; content extraction; copyright; entertainment; information extraction; information retrieval; navigation bars; pattern matching; pixel maps; privacy notices; regional Web documents; regional languages; relevant hyperlinks; research; statistical model; travel planning; ANN; Media Mining; Multi-Lingual; Statistical Interpretation;