• DocumentCode
    2169960
  • Title

    Identifying the Dominant Language of Web Page Using Supervised N-grams

  • Author

    Choon-Ching Ng ; Siau-Chuin Liew ; Hussin, W.M.S.W. ; Herawan, Tutut

  • Author_Institution
    Fac. of Comput. Syst. & Software Eng., Univ. Malaysia Pahang, Pekan, Malaysia
  • fYear
    2012
  • fDate
    26-28 Nov. 2012
  • Firstpage
    344
  • Lastpage
    348
  • Abstract
    Natural language processing is an emerging technology in linguistic industry and an aid to human-computer interaction in computer science. Language identification, on the other hand, is a form of pattern recognition that helps to identify predefined language of a web page and to predict the unknown language of one particular text. Written texts are constructed by common features such as character, word and n-gram and these characteristics are unique among languages. From the experiment result, the performance of the supervised n-gram produces an accurate identification value and outperforms the distance measurement on Arabic script web pages.
  • Keywords
    Web sites; natural language processing; support vector machines; text analysis; Arabic script Web page; Web page dominant language identification; computer science; distance measurement; human-computer interaction; linguistic industry; natural language processing; pattern recognition; supervised N-grams; support vector machine; text language; written text; Arabic script; Support vector machine; language identification; supervised N-grams;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Advanced Computer Science Applications and Technologies (ACSAT), 2012 International Conference on
  • Conference_Location
    Kuala Lumpur
  • Print_ISBN
    978-1-4673-5832-3
  • Type

    conf

  • DOI
    10.1109/ACSAT.2012.74
  • Filename
    6516378