Title :
False hits of tri-syllabic queries in a Chinese signature file
Author :
Liang, Tyne ; Lee, Suh-Yin ; Yang, Wei-Pang
Author_Institution :
Inst. of Comput. Sci. & Inf. Eng., Nat. Chiao Tung Univ., Hsinchu, Taiwan
Abstract :
In the application of the superimposed coding method to character-based Chinese text retrieval we find two kinds of false hits for a multi-syllabic (multicharacter) query. The first type is a random false hit (RFH) which is due to accidental setting of bits by irrelevant characters in a document signature. The other type is an adjacency false hit (AFH) which is due to the loss of character sequence information in signature creation. Since many query terms are proper nouns and Chinese names which often contain three characters (tri-syllabic), we derive a formula to estimate the RFH for trisyllabic queries. As for the AFH which cannot be reduced by single character (monogram) hashing method, a method which hashes consecutive character pairs (bigram) is designed to reduce both the AFH and the RFH. We find that there exists an optimal weight assignment for a minimal false hit rate in a combined scheme which encodes both monogram and bigram keys in document signatures
Keywords :
database theory; document image processing; optical character recognition; query processing; visual databases; Chinese names; Chinese signature file; adjacency false hit; bigram; character hashing method; character sequence information; character-based Chinese text retrieval; document signature; irrelevant characters; multicharacter query; multisyllabic query; optimal weight assignment; proper nouns; random false hit; superimposed coding method; trisyllabic query false hits; Hardware; Optimization methods;
Conference_Titel :
Document Analysis and Recognition, 1995., Proceedings of the Third International Conference on
Conference_Location :
Montreal, Que.
Print_ISBN :
0-8186-7128-9
DOI :
10.1109/ICDAR.1995.598966