DocumentCode :
2104122
Title :
Two Entropy-Based Methods for Detecting Errors in POS-Tagged Treebank
Author :
Nguyen, Phuong-Thai ; Le, Anh-Cuong ; Ho, Tu-Bao ; Do, Thi-Thanh-Tam
Author_Institution :
Univ. of Eng. & Technol., Hanoi, Vietnam
fYear :
2011
fDate :
14-17 Oct. 2011
Firstpage :
150
Lastpage :
156
Abstract :
This paper proposes two methods of employing conditional entropy to find errors and inconsistencies in tree bank corpora. These methods are based on two principles that high entropy implies high possibility of error and that entropy is reduced after error correction. The first method ranks error candidates using a scoring function based on conditional entropy. The second method uses beam search to find a subset of error candidates in which the change of labels leads to decreasing of conditional entropy. We carried out experiments with Vietnamese tree bank corpus at two levels of annotation including word segmentation and part-of-speech tagging. Our experiments showed that these methods detected high-error-density subsets of original error candidate sets. The size of these subsets is only one third the size of whole sets, while these subsets contain 80%-90% of errors in whole sets. Moreover, entropy was significantly reduced after error correction.
Keywords :
entropy; error correction; error detection; speech processing; trees (mathematics); word processing; POS-tagged treebank; Vietnamese treebank corpus; conditional entropy; entropy-based methods; error correction; error detection; high-error-density subsets; part-of-speech tagging; scoring function; word segmentation; Compounds; Data mining; Educational institutions; Entropy; Error correction; Manuals; Tagging; corpus; entropy; error detection; part of speech (POS) tagging; treebank; word segmentation;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Knowledge and Systems Engineering (KSE), 2011 Third International Conference on
Conference_Location :
Hanoi
Print_ISBN :
978-1-4577-1848-9
Type :
conf
DOI :
10.1109/KSE.2011.30
Filename :
6063458
Link To Document :
بازگشت