DocumentCode :
638330
Title :
Building Arabic corpora from Wikisource
Author :
Bensalem, Imene ; Chikhi, Salim ; Rosso, Paolo
Author_Institution :
MISC Lab., Constantine 2 Univ., Constantine, Algeria
fYear :
2013
fDate :
27-30 May 2013
Firstpage :
1
Lastpage :
2
Abstract :
This paper describes a new tool that helps extracting clean text from the Arabic Wikisource dump in order to build corpora. The tool purpose is illustrated by the generation of a subcorpus from Wikisource, which is a step towards the building of an evaluation corpus for Arabic intrinsic plagiarism detection.
Keywords :
Web sites; natural language processing; text analysis; Arabic corpora building; Arabic intrinsic plagiarism detection; Wikisource; clean text extraction; subcorpus; Buildings; Cleaning; Educational institutions; Encyclopedias; Plagiarism; Writing; XML; Arabic Wikisource; intrinsic plagiarism detection; tools for building corpora;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computer Systems and Applications (AICCSA), 2013 ACS International Conference on
Conference_Location :
Ifrane
ISSN :
2161-5322
Type :
conf
DOI :
10.1109/AICCSA.2013.6616474
Filename :
6616474
Link To Document :
بازگشت