Title :
A novel unsupervised method for named-entity identification in resource-poor languages using bilingual corpus
Author :
Seraj, Ramtin Mehdizadeh ; Jabbari, Fattaneh ; Khadivi, Shahram
Author_Institution :
Dept. of Comput. Eng., Amirkabir Univ. of Technol., Tehran, Iran
Abstract :
We propose a new unsupervised method to identify Named Entities (NE) in resource-poor languages. The idea is to transfer the knowledge of NEs from a resource-rich language to a resource-poor one by using a bilingual parallel corpus of this language pair. After extracting all NE pair candidates and filtering these candidates (includes lexical and contextual filters) to obtain a high precision seed of NEs, a graph is created for each language using these seeds. This graph is used for bootstrapping of the primary seeds. Based on output of the graph, a classifier is trained to identify NEs in the resource-poor language. In this paper, Farsi and English are selected as representatives for resource-poor and resource-rich languages, respectively. Because Farsi is a non-Latin language, we present a new distance function called M-distance to compute edit distance between Latin and Farsi scripts. Finally, we released a Farsi NE identifier (without using specific features of Farsi) for the first time with F1 score of 0.74.
Keywords :
graph theory; information filtering; natural language processing; pattern classification; English; Farsi NE identifier; Farsi script; Latin script; M-distance; NE pair candidate extraction; bilingual parallel corpus; candidate filtering; classifier; contextual filters; distance function; graph; lexical filters; named-entity identification; primary seed bootstrapping; resource-poor languages; resource-rich language; unsupervised method; Computational linguistics; Context; Educational institutions; Electronic mail; Joints; Probability; Training;
Conference_Titel :
Telecommunications (IST), 2014 7th International Symposium on
Conference_Location :
Tehran
Print_ISBN :
978-1-4799-5358-5
DOI :
10.1109/ISTEL.2014.7000759