DocumentCode :
2654271
Title :
Towards acquisition of a thematic Persian corpus from the Tebyan Portal: TebCorp
Author :
Khalifehsoltani, Sayed Nasir ; Cholmaghani, Ali ; Vahdani, Ali ; Moallemi, Reza
Author_Institution :
Dept. of Comput. Eng., SheikhBahee Univ., Esfahan, Iran
Volume :
7
fYear :
2010
fDate :
16-18 April 2010
Abstract :
The TebCorp collection is a large thematic modern Persian text collection which consists of 500 MB of text from Tebyan Portal. TebCorp contains more than 93,000 articles in 1097 topics and includes more than 44 million total words and about 550,000 distinct words which is suitable for information retrieval researches. In this paper we tried to exploit Tebyan portal - containing vast amount of prominent Persian articles - as a linguistic resource to build a multipurpose thematic corpus for Persian. We will present particular details on building this corpus including information retrieval and collection assessment. We will then conclude by giving practical information about this corpus.
Keywords :
data mining; information retrieval; natural language processing; Persian articles; TebCorp; Tebyan portal; Web mining; collection assessment; information retrieval; linguistic resource; natural language processing; text collection; thematic Persian corpus; Buildings; Cultural differences; Dictionaries; Distributed databases; Information retrieval; Natural language processing; Natural languages; Portals; Testing; Web mining; Information Extraction; Linguistic Resources; Natural Language Processing; Persian Corpora; Web Mining;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computer Engineering and Technology (ICCET), 2010 2nd International Conference on
Conference_Location :
Chengdu
Print_ISBN :
978-1-4244-6347-3
Type :
conf
DOI :
10.1109/ICCET.2010.5485685
Filename :
5485685
Link To Document :
بازگشت