DocumentCode
2781257
Title
Chinese query expansion based on user log clustering
Author
Jia, Shufang ; Li, Lei
Author_Institution
Center for Intell. Sci. & Technol., Beijing Univ. of Posts & Telecommun., Beijing, China
fYear
2009
fDate
6-8 Nov. 2009
Firstpage
446
Lastpage
451
Abstract
Most previous query expansion researches are based on pseudo relevant documents. In this study, we present a novel expansion method by clustering the real user log. Because not all of the clicked pages are suitable for query expansion, we de-noised the clicked results by reliability to enhance the performance. After HTML labels removing, the page body contents are clustered and the cluster centers cover various aspects of the original query. The terms used in log queries can provide a better choice of features, from the user´s point of view, for summarizing the Web pages that were clicked from these queries. Therefore, the associated queries, reverse queries, Webpage title and keyword phrases are combined with the cluster centers to attain high-quality expansion terms for new queries. We also propose a new terminology extraction method through Baidu Baike. It can identify and extract the terminology phrase based on the manual edited dictionary online.
Keywords
Web sites; data mining; hypermedia markup languages; query processing; Baidu Baike; Chinese query expansion; HTML labels removal; Web page denoising; keyword phrases; manual edited online dictionary; page body contents; pseudo relevant documents; terminology phrase extraction; terminology phrase identification; user log clustering; Computer science; Data mining; Dictionaries; HTML; Information retrieval; Large scale integration; Noise reduction; Search engines; Terminology; Web pages; Baike terminology extraction; LSI clustering; Query expansion; log mining; webpage de-noising;
fLanguage
English
Publisher
ieee
Conference_Titel
Network Infrastructure and Digital Content, 2009. IC-NIDC 2009. IEEE International Conference on
Conference_Location
Beijing
Print_ISBN
978-1-4244-4898-2
Electronic_ISBN
978-1-4244-4900-6
Type
conf
DOI
10.1109/ICNIDC.2009.5360836
Filename
5360836
Link To Document