• DocumentCode
    638330
  • Title

    Building Arabic corpora from Wikisource

  • Author

    Bensalem, Imene ; Chikhi, Salim ; Rosso, Paolo

  • Author_Institution
    MISC Lab., Constantine 2 Univ., Constantine, Algeria
  • fYear
    2013
  • fDate
    27-30 May 2013
  • Firstpage
    1
  • Lastpage
    2
  • Abstract
    This paper describes a new tool that helps extracting clean text from the Arabic Wikisource dump in order to build corpora. The tool purpose is illustrated by the generation of a subcorpus from Wikisource, which is a step towards the building of an evaluation corpus for Arabic intrinsic plagiarism detection.
  • Keywords
    Web sites; natural language processing; text analysis; Arabic corpora building; Arabic intrinsic plagiarism detection; Wikisource; clean text extraction; subcorpus; Buildings; Cleaning; Educational institutions; Encyclopedias; Plagiarism; Writing; XML; Arabic Wikisource; intrinsic plagiarism detection; tools for building corpora;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer Systems and Applications (AICCSA), 2013 ACS International Conference on
  • Conference_Location
    Ifrane
  • ISSN
    2161-5322
  • Type

    conf

  • DOI
    10.1109/AICCSA.2013.6616474
  • Filename
    6616474