• DocumentCode
    3401909
  • Title

    Web document information extraction using class attribute approach

  • Author

    Srivastava, Sanjeev ; Haroon, Mohd ; Bajaj, Anu

  • Author_Institution
    CSE Deptt. IET, Dr. R.M.L. Avadh Univ., Faizabad, India
  • fYear
    2013
  • fDate
    20-22 Sept. 2013
  • Firstpage
    17
  • Lastpage
    22
  • Abstract
    As we know that “change is the nature”. In the world of information technology the changes happens rapidly. As the new technologies always changes the world of information representation, the effect is to find out relevant pieces of information is quite difficult because of the heavy noise, cluttering with distracted features(like advertisements, links, scrollers etc.) in the whole web page. Information or useful content extraction from the web pages(structured or semi strutured) becomes a critical issue for web users and web miners. The user can be misguided by the noise of the web page. So the information extraction from the web page carries a huge importance. A confusing puzzle for information extraction is to define the noise domain and its removal. In the recent studies we all well known about the wrapper induction, feature extractor, back propagation algorithm of neural network, content extractor, PAT trees, etc. In the paper followed by the abstract we investigate the DOM tree segmentation with class attribute based approach. The class attribute can be used with all HTML elements inside the `BODY´ section of the document. It is used to create different classes of an element, where each class can have its own properties. To evaluate the system performance several experiments done on different commercial, news, entertainment websites. Experiments indicate our method is applicable to extract informative content from web pages of these websites.
  • Keywords
    Web sites; data mining; hypermedia markup languages; information retrieval; tree data structures; DOM tree segmentation; HTML elements; PAT trees; Web document information extraction; Web miners; Web page content extraction; Web users; backpropagation algorithm; class attribute approach; commercial Websites; content extractor; entertainment Websites; feature extractor; information representation; informative content extraction; neural network; news Websites; wrapper induction; Computers; Feature extraction; HTML; Noise; Web pages; XML; Classes; DOM; DOM tree; HTML; XHTML Segmentation;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer and Communication Technology (ICCCT), 2013 4th International Conference on
  • Conference_Location
    Allahabad
  • Print_ISBN
    978-1-4799-1569-9
  • Type

    conf

  • DOI
    10.1109/ICCCT.2013.6749596
  • Filename
    6749596