DocumentCode
3401909
Title
Web document information extraction using class attribute approach
Author
Srivastava, Sanjeev ; Haroon, Mohd ; Bajaj, Anu
Author_Institution
CSE Deptt. IET, Dr. R.M.L. Avadh Univ., Faizabad, India
fYear
2013
fDate
20-22 Sept. 2013
Firstpage
17
Lastpage
22
Abstract
As we know that “change is the nature”. In the world of information technology the changes happens rapidly. As the new technologies always changes the world of information representation, the effect is to find out relevant pieces of information is quite difficult because of the heavy noise, cluttering with distracted features(like advertisements, links, scrollers etc.) in the whole web page. Information or useful content extraction from the web pages(structured or semi strutured) becomes a critical issue for web users and web miners. The user can be misguided by the noise of the web page. So the information extraction from the web page carries a huge importance. A confusing puzzle for information extraction is to define the noise domain and its removal. In the recent studies we all well known about the wrapper induction, feature extractor, back propagation algorithm of neural network, content extractor, PAT trees, etc. In the paper followed by the abstract we investigate the DOM tree segmentation with class attribute based approach. The class attribute can be used with all HTML elements inside the `BODY´ section of the document. It is used to create different classes of an element, where each class can have its own properties. To evaluate the system performance several experiments done on different commercial, news, entertainment websites. Experiments indicate our method is applicable to extract informative content from web pages of these websites.
Keywords
Web sites; data mining; hypermedia markup languages; information retrieval; tree data structures; DOM tree segmentation; HTML elements; PAT trees; Web document information extraction; Web miners; Web page content extraction; Web users; backpropagation algorithm; class attribute approach; commercial Websites; content extractor; entertainment Websites; feature extractor; information representation; informative content extraction; neural network; news Websites; wrapper induction; Computers; Feature extraction; HTML; Noise; Web pages; XML; Classes; DOM; DOM tree; HTML; XHTML Segmentation;
fLanguage
English
Publisher
ieee
Conference_Titel
Computer and Communication Technology (ICCCT), 2013 4th International Conference on
Conference_Location
Allahabad
Print_ISBN
978-1-4799-1569-9
Type
conf
DOI
10.1109/ICCCT.2013.6749596
Filename
6749596
Link To Document