• DocumentCode
    479081
  • Title

    Dictionary-Based Bilingual Web Page Classification

  • Author

    Liu, Jicheng ; Liang, Chunyan ; Qi, Jianxun

  • Author_Institution
    North China Electr. Power Univ., Beijing
  • fYear
    2008
  • fDate
    12-14 Oct. 2008
  • Firstpage
    1
  • Lastpage
    4
  • Abstract
    Web page classification poses new research challenges because of the noisy nature of the pages. For the bilingual Chinese-English web pages, it also needs to be considered that how to extract the terms of different languages exactly. A new dictionary-based multilingual text categorization approach is proposed in this paper to try to classify the Chinese-English web pages in specific domain into a hierarchical topic structure more accurately. The approach can properly recognize and integrate the web page encodings by using an automatic encoding detection and integration method. This makes the feature extraction more precise for the multilingual pages. The approach can also intensify the domain concepts in the web pages based on a domain dictionary. From the results of the experiments, it can be found that the proposed approach get the better performance than the traditional classification method when classifying the bilingual web pages.
  • Keywords
    Web sites; classification; dictionaries; natural language processing; text analysis; automatic encoding detection; bilingual Chinese-English Web pages; dictionary-based bilingual Web page classification; dictionary-based multilingual text categorization; domain concepts; domain dictionary; feature extraction; hierarchical topic structure; integration method; multilingual pages; Character recognition; Classification tree analysis; Data mining; Dictionaries; Encoding; Feature extraction; Internet; Natural languages; Text categorization; Web pages;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Wireless Communications, Networking and Mobile Computing, 2008. WiCOM '08. 4th International Conference on
  • Conference_Location
    Dalian
  • Print_ISBN
    978-1-4244-2107-7
  • Electronic_ISBN
    978-1-4244-2108-4
  • Type

    conf

  • DOI
    10.1109/WiCom.2008.2684
  • Filename
    4680873