Title of article

ParsNER-Social: A Corpus for Named Entity Recognition in Persian Social Media Texts

Author/Authors

Asgari-Bidhendi, Majid Computer Engineering School - Iran University of Science and Technology - Tehran, Iran , Janfada, Behrooz Computer Engineering School - Iran University of Science and Technology - Tehran, Iran , Roshani Talab, Omid Reza Computer Engineering School - Iran University of Science and Technology - Tehran, Iran , Minaei-Bidgoli, Behrouz Computer Engineering School - Iran University of Science and Technology - Tehran, Iran

Pages

From page

181

To page

192

Abstract

Named Entity Recognition (NER) is one of the essential prerequisites for many natural language processing tasks. All public corpora for Persian named entity recognition such as ParsNERCorp and ArmanPersoNERCorpus are based on the Bijankhan corpus, which is originated from the Hamshahri newspaper in 2004. Correspondingly, most of the published named entity recognition models in Persian are specially tuned for the news data and are not flexible enough to be applied in different text categories such as social media texts. In this work, we introduce ParsNER-Social, a corpus for training named entity recognition models in the Persian language built from social media sources. This corpus consists of 205,373 tokens, and their NER tags crawled from social media contents, including 10 Telegram channels in 10 different categories. Furthermore, three supervised methods are introduced and trained based on the ParsNER-Social corpus: two conditional random field models as baseline models and one state-of-the-art deep learning model with six different configurations are evaluated on the basis of the proposed dataset. The experiments performed show that the Mono-Lingual Persian models based on Bidirectional Encoder Representations from Transformers (MLBERT) outperform the other approaches on the ParsNER-Social corpus. Among the different configurations of the MLBERT models, the ParsBERT+BERT-TokenClass model has obtained an F1-score of 89.65%.

Keywords

Named Entity Recognition , Natural Language Processing , Social Media Corpus , Persian Language

Journal title

Journal of Artificial Intelligence and Data Mining

Serial Year

2021

Record number

2685751

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=10&DC=2685751