DocumentCode :
2733160
Title :
An Intelligent Web Agent to Mine Bilingual Parallel Pages via Automatic Discovery of URL Pairing Patterns
Author :
Kit, Chunyu ; Ng, Jessica Yee Ha
Author_Institution :
City Univ. of Hong Kong, Hong Kong
fYear :
2007
fDate :
5-12 Nov. 2007
Firstpage :
526
Lastpage :
529
Abstract :
This paper describes an intelligent agent to facilitate bi-text mining from the Web via automatic discovery of URL pairing patterns (or keys) for retrieving parallel Web pages. The linking power of a key, defined as the number of URL pairs that it can match, is used as the objective function for the search for the best set of keys that can find the greatest number of Web page pairs within a bilingual Web site. Our experiments show that, with no prior knowledge such as ad hoc heuristics, no labelled data for training and no similarity analysis of Web page structure and content that are commonly involved in the existing approaches, a best-first search to approximate this optimization with an empirical threshold can recognize 98.1% true parallel Web pages and discover many irregular pairing patterns that are unlikely to be discovered by other approaches.
Keywords :
Web sites; data mining; natural language processing; software agents; text analysis; World Wide Web; automatic URL pairing pattern discovery; bi-text mining; bilingual Web site; bilingual parallel page mining; intelligent Web agent; parallel Web page retrieval; Conferences; Intelligent agent; Joining processes; Natural language processing; Pattern analysis; Pattern matching; Pattern recognition; Uniform resource locators; Web mining; Web pages; parallel web pagesURL pairing patternbitext mining;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Web Intelligence and Intelligent Agent Technology Workshops, 2007 IEEE/WIC/ACM International Conferences on
Conference_Location :
Silicon Valley, CA
Print_ISBN :
0-7695-3028-1
Type :
conf
DOI :
10.1109/WI-IATW.2007.107
Filename :
4427643
Link To Document :
بازگشت