شماره ركورد كنفرانس :
3297
عنوان مقاله :
SAZEH: A Wide Coverage Persian Constituency Tree Bank and Parser
عنوان به زبان ديگر :
SAZEH: A Wide Coverage Persian Constituency Tree Bank and Parser
پديدآورندگان :
Tabatabayi Seifi Shohreh RCDAT: Research Center for Development of Advanced Technologies Speech Group Tehran - Iran , Sarraf Rezaee Iman RCDAT: Research Center for Development of Advanced Technologies Speech Group Tehran - Iran
كليدواژه :
Natural Language Processing , Constituency Parser , Constituency Treebank
عنوان كنفرانس :
نوزدهمين سمپوزيوم بين المللي هوش مصنوعي و پردازش سيگنال
چكيده لاتين :
Constituency parsing is one of the basic operations in
many NLP tasks such as translation, Information Extraction,
Abstractive Summarization and etc. We need wide coverage
constituency treebank to train a probabilistic parser. SAZEH is
the first large-volume Persian constituency treebank with more
than 21000 parsed trees and 627000 tokens. The average length
of its sentences is 30 words. They are chosen from Peykare
Corpus which already has POS tags. Berkeley Lexical Parser is
trained on SAZEH corpus and the best F-measure attained on
the test part of the corpus is 81.65% using gold POS-tags.