• Title of article

    ParsNER-Social: A Corpus for Named Entity Recognition in Persian Social Media Texts

  • Author/Authors

    Asgari-Bidhendi, Majid Computer Engineering School - Iran University of Science and Technology - Tehran, Iran , Janfada, Behrooz Computer Engineering School - Iran University of Science and Technology - Tehran, Iran , Roshani Talab, Omid Reza Computer Engineering School - Iran University of Science and Technology - Tehran, Iran , Minaei-Bidgoli, Behrouz Computer Engineering School - Iran University of Science and Technology - Tehran, Iran

  • Pages
    12
  • From page
    181
  • To page
    192
  • Abstract
    Named Entity Recognition (NER) is one of the essential prerequisites for many natural language processing tasks. All public corpora for Persian named entity recognition such as ParsNERCorp and ArmanPersoNERCorpus are based on the Bijankhan corpus, which is originated from the Hamshahri newspaper in 2004. Correspondingly, most of the published named entity recognition models in Persian are specially tuned for the news data and are not flexible enough to be applied in different text categories such as social media texts. In this work, we introduce ParsNER-Social, a corpus for training named entity recognition models in the Persian language built from social media sources. This corpus consists of 205,373 tokens, and their NER tags crawled from social media contents, including 10 Telegram channels in 10 different categories. Furthermore, three supervised methods are introduced and trained based on the ParsNER-Social corpus: two conditional random field models as baseline models and one state-of-the-art deep learning model with six different configurations are evaluated on the basis of the proposed dataset. The experiments performed show that the Mono-Lingual Persian models based on Bidirectional Encoder Representations from Transformers (MLBERT) outperform the other approaches on the ParsNER-Social corpus. Among the different configurations of the MLBERT models, the ParsBERT+BERT-TokenClass model has obtained an F1-score of 89.65%.
  • Keywords
    Named Entity Recognition , Natural Language Processing , Social Media Corpus , Persian Language
  • Journal title
    Journal of Artificial Intelligence and Data Mining
  • Serial Year
    2021
  • Record number

    2685751