Title :
Spoken document retrieval using both word-based and syllable-based document spaces with latent semantic indexing
Author :
Ichikawa, Kazuhisa ; Tsuge, Satoru ; Kitaoka, Norihide ; Takeda, Kenji ; Kita, Kahori
Author_Institution :
Nagoya Univ., Nagoya, Japan
fDate :
Oct. 29 2013-Nov. 1 2013
Abstract :
In this paper, we propose a spoken document retrieval method using vector space models in multiple document spaces. First we construct multiple document vector spaces, one of which is based on continuous-word speech recognition results and the other on continuous-syllable speech recognition results. Query expansion is also applied to the word-based document space. We proposed to apply latent semantic indexing (LSI) not only to the word-based space but also to the syllable-based space, to reduce dimensionality of the spaces using implicitly defined semantics. Finally, we combine the distances and compare the distance between the query and the available documents in various spaces to rank the documents. In this procedure, we propose to model the document by hyperplane. To evaluate our proposed method, we conducted spoken document retrieval experiments using the NTCIR-9 SpokenDoc data set. The results showed that using the combination of the distances, and using LSI on the syllable-based document space, improved retrieval performance.
Keywords :
document handling; indexing; information retrieval; speech recognition; LSI; NTCIR-9 SpokenDoc data set; continuous syllable speech recognition; continuous word speech recognition; hyperplane; latent semantic indexing; multiple document vector spaces; query expansion; spoken document retrieval performance; syllable based document spaces; vector space models; word based document spaces; Indexes; Large scale integration; Semantics; Speech; Speech recognition; Vectors; Web pages;
Conference_Titel :
Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2013 Asia-Pacific
Conference_Location :
Kaohsiung
DOI :
10.1109/APSIPA.2013.6694119