مرکز منطقه ای اطلاع رساني علوم و فناوري

DocumentCode :

1938616

Title :

Mining Collective Pair Data from the Web

Author :

Fan, Cong ; Jiang, Long ; Zhou, Ming ; Wang, Shi-Long

Author_Institution :

Chongqing Univ., Chongqing

Volume :

fYear :

2007

fDate :

19-22 Aug. 2007

Firstpage :

3997

Lastpage :

4002

Abstract :

Pair data is a kind of data, which consists of two correlative data components. Book title and its author, product name and its price, bilingual translation term and Chinese couplet (a unit of verse consisting of two successive lines) are of this type data. In this paper, based on the observation that pair data tend to co-occur in the same block of the same Web page following similar patterns, we propose a new approach to extract the collective pair data. A recursive process is used to extract collective pair data from Web. An automatic algorithm of discovering repeated patterns based on a data structure called PAT tree is proposed to discover all repeated patterns first, then all these repeated patterns are ranked with a ranking SVM to get the trusty pair data extraction patterns. Finally the patterns are transformed with some predefined surface pattern classes and then applied to extract collective pair data. Experimental results demonstrate our model gains higher extraction precision and recall than previous approach.

Keywords :

Internet; data mining; information retrieval; support vector machines; PAT tree; World Wide Web; collective pair data extraction; collective pair data mining; ranking SVM; repeated pattern discovery; Asia; Books; Cybernetics; Data mining; Electronic mail; Machine learning; Mechanical engineering; Software engineering; Support vector machines; Web pages; Pattern discovery; Ranking SVM; Web mining;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Machine Learning and Cybernetics, 2007 International Conference on

Conference_Location :

Hong Kong

Print_ISBN :

978-1-4244-0973-0

Electronic_ISBN :

978-1-4244-0973-0

Type :

conf

DOI :

10.1109/ICMLC.2007.4370845

Filename :

4370845

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1938616