Title :
Text analysis case study: Determining word Frequency based on Azerbaijan top 500 websites.
Author :
Abzetdin Z. Adamov
Author_Institution :
Applied Research Center for Data Analytics and Web Insights (CeDAWI), Qafqaz University, Baku, Azerbaijan
Abstract :
Word Frequency Distribution (WFD) is one the most important sub-areas of Natural Language Processing (NLP) and Computational Linguistic. The reliability and quality of WFD results are highly dependent on the size and quality of the corpora. In this paper describes the ongoing project with aim to build a corpus Azerbaijani text AzWebCorpus. Top 500 websites in Azerbaijan are used as a text source for corpus building. Most of essential tools including Web Crawler, Text Cleaner, Tokenizer have been developed and several opensource tools have been used. Moreover, AzWebCorpus compared to another corpus AzBookCorpus built on text taken from electronic books in terms of word frequency. Same approach that used in this paper is applicable for other languages.
Keywords :
"HTML","Fires","Poles and towers","Frequency conversion","Syntactics"
Conference_Titel :
Application of Information and Communication Technologies (AICT), 2015 9th International Conference on
Print_ISBN :
978-1-4673-6855-1
DOI :
10.1109/ICAICT.2015.7338521