• DocumentCode
    1627520
  • Title

    Persica: A Persian corpus for multi-purpose text mining and natural language processing

  • Author

    Eghbalzadeh, H. ; Hosseini, Behrooz ; Khadivi, Shahram ; Khodabakhsh, Ali

  • Author_Institution
    Dept. of Comput. Sci. & Eng., Shiraz Univ., Shiraz, Iran
  • fYear
    2012
  • Firstpage
    1207
  • Lastpage
    1214
  • Abstract
    Lack of multi-application text corpus despite of the surging text data is a serious bottleneck in the text mining and natural language processing especially in Persian language. This paper presents a new corpus for NEWS articles analysis in Persian called Persica. NEWS analysis includes NEWS classification, topic discovery and classification, trend discovery, category classification and many more procedures. Dealing with NEWS has special requirements. First of all it needs a valid and NEWS-content-enriched corpus to perform the experiments. Our Approach is based on a modified category classification and data normalization over Persian NEWS articles which has led to creation of a multipurpose Persian corpus which shows reasonable results in text mining outcomes. In the literature, regarding to our knowledge there are few Persian corpuses but none of them have Persian NEWS time trend characteristics. Empirical results on our benchmark indicate that in addition to reducing the problem dimensions and useless content, Persica keeps admissible validity and reliability in comparison with standard corpuses in the literature.
  • Keywords
    data mining; natural language processing; text analysis; NEWS articles analysis; NEWS classification; NEWS-content-enriched corpus; Persian NEWS time trend characteristic; Persian corpus; Persian language; Persica; category classification; data normalization; multipurpose text mining; natural language processing; topic classification; topic discovery; trend discovery; Educational institutions; HTML; Market research; Pragmatics; Reliability; Standards; Text mining; Categorization; Data mining; Text Classification; Text Mining; subject and trend detectin;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Telecommunications (IST), 2012 Sixth International Symposium on
  • Conference_Location
    Tehran
  • Print_ISBN
    978-1-4673-2072-6
  • Type

    conf

  • DOI
    10.1109/ISTEL.2012.6483172
  • Filename
    6483172