Algorithms of mining intact record from isomorphic Web page

Author

Qiu, Yong ; Lan, Yong-Jie

Author_Institution

Sch. of Inf. & Electron. Eng., Shanghai Inst. of Bus. & Technol., China

Volume

4

fYear

2005

fDate

18-21 Aug. 2005

Firstpage

2373

Abstract

The huge amount of information available on the Web has attracted many research efforts into developing tools to extract data from Web pages. Many Web pages are generated automatically from an underlying database; therefore, the HTML structure of pages is fairly specific and regular. Some existing algorithms like OMINI, MDR can extract information from multi-recording Web pages, the main point is to identify repetitive record structure automatically. However, Web pages maintain multi-records are actually directory page, the information in directory page is not intact; the intact information exists in lower level Web page, called detailed page. A detailed page has one record information only, so it can not be extracted using duplicated record finding algorithm. To solve this problem, extracting intact information from Web, a concept of isomorphic Web page is introduced, and two algorithm are proposed, one algorithm for finding directory that has isomorphic Web pages, the other for mining record information from isomorphic Web pages.

Keywords

Internet; data mining; hypermedia markup languages; information retrieval; HTML; detailed page; directory page; duplicated record finding algorithm; isomorphic Web page; Data engineering; Data mining; Databases; Electronic mail; HTML; Local area networks; Machine learning; Software systems; Web mining; Web pages; Information Extracting; WEB; WEB mining; isomorphic webpage; webpage;

fLanguage

English

Publisher

ieee

Conference_Titel

Machine Learning and Cybernetics, 2005. Proceedings of 2005 International Conference on

Conference_Location

Guangzhou, China

Print_ISBN

0-7803-9091-1

Type

conf

DOI

10.1109/ICMLC.2005.1527341

Filename

1527341