DocumentCode
441867
Title
Algorithms of mining intact record from isomorphic Web page
Author
Qiu, Yong ; Lan, Yong-Jie
Author_Institution
Sch. of Inf. & Electron. Eng., Shanghai Inst. of Bus. & Technol., China
Volume
4
fYear
2005
fDate
18-21 Aug. 2005
Firstpage
2373
Abstract
The huge amount of information available on the Web has attracted many research efforts into developing tools to extract data from Web pages. Many Web pages are generated automatically from an underlying database; therefore, the HTML structure of pages is fairly specific and regular. Some existing algorithms like OMINI, MDR can extract information from multi-recording Web pages, the main point is to identify repetitive record structure automatically. However, Web pages maintain multi-records are actually directory page, the information in directory page is not intact; the intact information exists in lower level Web page, called detailed page. A detailed page has one record information only, so it can not be extracted using duplicated record finding algorithm. To solve this problem, extracting intact information from Web, a concept of isomorphic Web page is introduced, and two algorithm are proposed, one algorithm for finding directory that has isomorphic Web pages, the other for mining record information from isomorphic Web pages.
Keywords
Internet; data mining; hypermedia markup languages; information retrieval; HTML; detailed page; directory page; duplicated record finding algorithm; isomorphic Web page; Data engineering; Data mining; Databases; Electronic mail; HTML; Local area networks; Machine learning; Software systems; Web mining; Web pages; Information Extracting; WEB; WEB mining; isomorphic webpage; webpage;
fLanguage
English
Publisher
ieee
Conference_Titel
Machine Learning and Cybernetics, 2005. Proceedings of 2005 International Conference on
Conference_Location
Guangzhou, China
Print_ISBN
0-7803-9091-1
Type
conf
DOI
10.1109/ICMLC.2005.1527341
Filename
1527341
Link To Document