• DocumentCode
    476745
  • Title

    Unigram language identifications using adaptive neutral network

  • Author

    Selamat, Ali ; Ng, Choon-Ching

  • Author_Institution
    Faculty of Computer Science and Information, System, Universiti Teknologi Malaysia, 81310, Johor Bharu, Malaysia
  • Volume
    2
  • fYear
    2008
  • fDate
    26-28 Aug. 2008
  • Firstpage
    1
  • Lastpage
    5
  • Abstract
    In general, a web document page may contain several script forms. Each script can be used for constructing different languages. Determining the languages of the document is the required to effectively be able to apply many search and information retrieval techniques. In this work, we propose hybrid-grams feature selection methods by integrating unigram and bigrams. The method makes use of local statistical information or data within a document to determine the language. From the experiments, we have noticed that hybrid-grams are outperformed than unigram and bigrams in Cyrillic and Indic script language identifications.
  • Keywords
    Adaptive systems; Computer science; Data mining; Electronic mail; Encoding; Feature extraction; Information retrieval; Internet; Natural languages; Statistics;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Information Technology, 2008. ITSim 2008. International Symposium on
  • Conference_Location
    Kuala Lumpur, Malaysia
  • Print_ISBN
    978-1-4244-2327-9
  • Electronic_ISBN
    978-1-4244-2328-6
  • Type

    conf

  • DOI
    10.1109/ITSIM.2008.4631694
  • Filename
    4631694