Title :
Broadcast News Story Clustering via Term and Sentence Matching
Author :
Foong Kuin Yow ; Tien-Ping Tan
Author_Institution :
Sch. of Comput. Sci., Univ. Sci. Malaysia, Minden, Malaysia
Abstract :
In this paper, we propose a rule-based approach that uses the term and sentence matching criteria for clustering Malay broadcast news to different stories. The proposed clustering method does not require users to predefined number of clusters. The three main stages of the clustering are sentences segmentation, indexing, and also term and sentence matching clustering. The sentences in the transcription will be segmented before indexing. Indexing involves tokenization, stop word removal, stemming, term selection and term representation. A vector space model (VSM) is used to represent the terms and sentences in the form of vectors. The sentences will then be grouped into clusters by using term and sentence matching thresholds. The proposed approach shows a significantly better accuracy than the baseline approaches.
Keywords :
indexing; information resources; pattern clustering; Malay broadcast news; VSM; broadcast news story clustering; indexing; rule-based approach; sentence matching; sentences segmentation; stemming; stop word removal; term matching; term representation; term selection; tokenization; vector space model; Accuracy; Algorithm design and analysis; Clustering algorithms; Indexing; Principal component analysis; Speech; Vectors; broadcast news; clustering; transcription;
Conference_Titel :
Asian Language Processing (IALP), 2013 International Conference on
Conference_Location :
Urumqi
DOI :
10.1109/IALP.2013.62