DocumentCode :
1938616
Title :
Mining Collective Pair Data from the Web
Author :
Fan, Cong ; Jiang, Long ; Zhou, Ming ; Wang, Shi-Long
Author_Institution :
Chongqing Univ., Chongqing
Volume :
7
fYear :
2007
fDate :
19-22 Aug. 2007
Firstpage :
3997
Lastpage :
4002
Abstract :
Pair data is a kind of data, which consists of two correlative data components. Book title and its author, product name and its price, bilingual translation term and Chinese couplet (a unit of verse consisting of two successive lines) are of this type data. In this paper, based on the observation that pair data tend to co-occur in the same block of the same Web page following similar patterns, we propose a new approach to extract the collective pair data. A recursive process is used to extract collective pair data from Web. An automatic algorithm of discovering repeated patterns based on a data structure called PAT tree is proposed to discover all repeated patterns first, then all these repeated patterns are ranked with a ranking SVM to get the trusty pair data extraction patterns. Finally the patterns are transformed with some predefined surface pattern classes and then applied to extract collective pair data. Experimental results demonstrate our model gains higher extraction precision and recall than previous approach.
Keywords :
Internet; data mining; information retrieval; support vector machines; PAT tree; World Wide Web; collective pair data extraction; collective pair data mining; ranking SVM; repeated pattern discovery; Asia; Books; Cybernetics; Data mining; Electronic mail; Machine learning; Mechanical engineering; Software engineering; Support vector machines; Web pages; Pattern discovery; Ranking SVM; Web mining;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Machine Learning and Cybernetics, 2007 International Conference on
Conference_Location :
Hong Kong
Print_ISBN :
978-1-4244-0973-0
Electronic_ISBN :
978-1-4244-0973-0
Type :
conf
DOI :
10.1109/ICMLC.2007.4370845
Filename :
4370845
Link To Document :
بازگشت