DocumentCode :
3698655
Title :
Text analysis case study: Determining word Frequency based on Azerbaijan top 500 websites.
Author :
Abzetdin Z. Adamov
Author_Institution :
Applied Research Center for Data Analytics and Web Insights (CeDAWI), Qafqaz University, Baku, Azerbaijan
fYear :
2015
Firstpage :
76
Lastpage :
79
Abstract :
Word Frequency Distribution (WFD) is one the most important sub-areas of Natural Language Processing (NLP) and Computational Linguistic. The reliability and quality of WFD results are highly dependent on the size and quality of the corpora. In this paper describes the ongoing project with aim to build a corpus Azerbaijani text AzWebCorpus. Top 500 websites in Azerbaijan are used as a text source for corpus building. Most of essential tools including Web Crawler, Text Cleaner, Tokenizer have been developed and several opensource tools have been used. Moreover, AzWebCorpus compared to another corpus AzBookCorpus built on text taken from electronic books in terms of word frequency. Same approach that used in this paper is applicable for other languages.
Keywords :
"HTML","Fires","Poles and towers","Frequency conversion","Syntactics"
Publisher :
ieee
Conference_Titel :
Application of Information and Communication Technologies (AICT), 2015 9th International Conference on
Print_ISBN :
978-1-4673-6855-1
Type :
conf
DOI :
10.1109/ICAICT.2015.7338521
Filename :
7338521
Link To Document :
بازگشت