• DocumentCode
    711537
  • Title

    Attribute based content mining for regional web documents

  • Author

    Prakash, Kolla Bhanu ; Dorai Rangaswamy, M.A. ; Raman, Arun Raja

  • Author_Institution
    Sathyabama Univ., Chennai, India
  • fYear
    2013
  • fDate
    12-14 Dec. 2013
  • Firstpage
    368
  • Lastpage
    373
  • Abstract
    The rapid expansion of the Internet has made the WWW a popular place for disseminating and collecting information. Extracting useful information from Web pages thus becomes an important task. Generally, apart from the main content blocks, web pages usually have such blocks as navigation bars, copyright and privacy notices, relevant hyperlinks, and advertisements, which are called noisy blocks. Although such information items are functionally useful for human viewers and necessary for the Web site owners, they often hamper Web page clustering, classification, information retrieval and information extraction. Today, people use the Web for a large variety of activities including travel planning, comparison shopping, entertainment, and research. However, the tools available for collecting, organizing, and sharing Web content have not kept pace with the rapid growth in information. But the major complexity arises when web documents or information is in regional languages. Extracting the content of the document and later communication through oral or text means is quite involved as both syntax and symantics are needed for this. Depending on the form and structure of the web document this task becomes difficult and this is the area the current paper addresses through a novel approach based on the pixel maps and using this how content could be extracted and knowledge is created in the minds of illiterate user. The paper first presents how letters and words which form the basis of text-based communication can be used for content. The objective of this task is to achieve a concept-based term analysis on the sentence and document levels rather than a single-term analysis in the document set only. This paper outlines the use of attributes for content extraction, using basic pixel attributes and pattern matching, statistical model and pattern matching and Artificial Neural Network training.
  • Keywords
    Internet; data mining; document handling; Internet; WWW; Web content; Web page clustering; Web site owners; advertisements; artificial neural network training; attribute based content mining; basic pixel attributes; comparison shopping; content extraction; copyright; entertainment; information extraction; information retrieval; navigation bars; pattern matching; pixel maps; privacy notices; regional Web documents; regional languages; relevant hyperlinks; research; statistical model; travel planning; ANN; Media Mining; Multi-Lingual; Statistical Interpretation;
  • fLanguage
    English
  • Publisher
    iet
  • Conference_Titel
    Sustainable Energy and Intelligent Systems (SEISCON 2013), IET Chennai Fourth International Conference on
  • Conference_Location
    Chennai
  • Print_ISBN
    978-1-78561-030-1
  • Type

    conf

  • DOI
    10.1049/ic.2013.0340
  • Filename
    7119727