DocumentCode
1587409
Title
Language Identifications of Arabic Script Web Documents Using Independent Component Analysis
Author
Selamat, Ali ; Lee, Zhi-Sam
Author_Institution
Fac. of Comput. Sci. & Inf. Syst., Univ. Teknol. Malaysia, Skudai
fYear
2008
Firstpage
427
Lastpage
432
Abstract
We analyze the language identification algorithms used to identify the Arabic script Web documents such as Arabic, Jawi, Persian and Urdu using independent component analysis (ICA). We have used a combination of Entropy term weighting scheme and class based feature (CPBF) vectors as feature selection methods for selecting the best features of Arabic script Web documents for Web page language identifications. Then we input the selected features based on the identification of latent semantics of user profiles using singular value decomposition (SVD). The SVD has been used to remove the noises on the documents retrieved before applying the ICA for topic extraction. We assume that the topic on each document is independent from each other. We have used the information retrieval measures that are precision, recall and F in order to evaluate the effectiveness of the proposed algorithm. From the experiments, we have found that the proposed method could leads to good Arabic script language identification results with good separations of Arabic, Persian, and Urdu languages using the ICA.
Keywords
Internet; computational linguistics; document handling; independent component analysis; information retrieval; natural language processing; singular value decomposition; Arabic script Web document; ICA; SVD; Web page language identification; class based feature vector; document retrieval; entropy term weighting scheme; feature selection; independent component analysis; information retrieval; latent semantics; noise removal; singular value decomposition; topic extraction; Algorithm design and analysis; Asia; Data mining; Entropy; Independent component analysis; Information retrieval; Machine learning algorithms; Natural languages; Singular value decomposition; Web pages; ICA; class profile based features; language identifications; web documents;
fLanguage
English
Publisher
ieee
Conference_Titel
Modeling & Simulation, 2008. AICMS 08. Second Asia International Conference on
Conference_Location
Kuala Lumpur
Print_ISBN
978-0-7695-3136-6
Electronic_ISBN
978-0-7695-3136-6
Type
conf
DOI
10.1109/AMS.2008.46
Filename
4530514
Link To Document