Web document information extraction using class attribute approach

Author

Srivastava, Sanjeev ; Haroon, Mohd ; Bajaj, Anu

Author_Institution

CSE Deptt. IET, Dr. R.M.L. Avadh Univ., Faizabad, India

fYear

2013

fDate

20-22 Sept. 2013

Firstpage

17

Lastpage

22

Abstract

As we know that “change is the nature”. In the world of information technology the changes happens rapidly. As the new technologies always changes the world of information representation, the effect is to find out relevant pieces of information is quite difficult because of the heavy noise, cluttering with distracted features(like advertisements, links, scrollers etc.) in the whole web page. Information or useful content extraction from the web pages(structured or semi strutured) becomes a critical issue for web users and web miners. The user can be misguided by the noise of the web page. So the information extraction from the web page carries a huge importance. A confusing puzzle for information extraction is to define the noise domain and its removal. In the recent studies we all well known about the wrapper induction, feature extractor, back propagation algorithm of neural network, content extractor, PAT trees, etc. In the paper followed by the abstract we investigate the DOM tree segmentation with class attribute based approach. The class attribute can be used with all HTML elements inside the `BODY´ section of the document. It is used to create different classes of an element, where each class can have its own properties. To evaluate the system performance several experiments done on different commercial, news, entertainment websites. Experiments indicate our method is applicable to extract informative content from web pages of these websites.

Keywords

Web sites; data mining; hypermedia markup languages; information retrieval; tree data structures; DOM tree segmentation; HTML elements; PAT trees; Web document information extraction; Web miners; Web page content extraction; Web users; backpropagation algorithm; class attribute approach; commercial Websites; content extractor; entertainment Websites; feature extractor; information representation; informative content extraction; neural network; news Websites; wrapper induction; Computers; Feature extraction; HTML; Noise; Web pages; XML; Classes; DOM; DOM tree; HTML; XHTML Segmentation;

fLanguage

English

Publisher

ieee

Conference_Titel

Computer and Communication Technology (ICCCT), 2013 4th International Conference on

Conference_Location

Allahabad

Print_ISBN

978-1-4799-1569-9

Type

conf

DOI

10.1109/ICCCT.2013.6749596

Filename

6749596