Title :
Domain-independent topic segmentation using a string kernel on recognized sub-word sequences
Author :
Sadohara, K. ; Lee, S.-w. ; Kojima, H.
Author_Institution :
Nat. Inst. of Adv. Ind. Sci. & Technol. (AIST), Tsukuba
Abstract :
The goal of the present paper is to explore the feasibility of a topic segmentation method without using large vocabulary continuous speech recognition (LVCSR). The proposed method is domain-independent in the sense that it is not constrained by vocabulary and does not require training data. For a sequence of sub-word units obtained using a continuous sub-word recognizer, the proposed method merges similar adjacent parts of the sequence in an agglomerative manner to produce a hierarchical cluster tree. The proposed method uses a string kernel to efficiently compute the similarity between two strings of sub-word units based on the frequencies of any sub-strings appearing in the strings. By carefully excluding the influence of the sub-strings that are irrelevant to the topic of interest, topically coherent clusters are formed without linguistic knowledge. An empirical study on a Japanese news speech corpus shows that the method performs better than a topic segmenter using LVCSR.
Keywords :
pattern clustering; string matching; vocabulary; word processing; Japanese news speech corpus; continuous sub-word recognizer; domain-independent topic segmentation; hierarchical cluster tree; recognized sub-word sequences; string kernel; vocabulary; Clustering algorithms; Clustering methods; Frequency; Kernel; Paper technology; Speech analysis; Speech recognition; Training data; Unsupervised learning; Vocabulary;
Conference_Titel :
Spoken Language Technology Workshop, 2006. IEEE
Conference_Location :
Palm Beach
Print_ISBN :
1-4244-0872-5
DOI :
10.1109/SLT.2006.326809