• DocumentCode
    174897
  • Title

    A Pure URL-Based Genre Classification of Web Pages

  • Author

    Jebari, Chaker

  • Author_Institution
    Inf. Technol. Dept., Ibri Coll. of Appl. Sci., Ibri, Oman
  • fYear
    2014
  • fDate
    1-5 Sept. 2014
  • Firstpage
    233
  • Lastpage
    237
  • Abstract
    In this paper, we propose a new approach for multi-label genre classification of web pages that exploits character n-grams extracted from the URL of the web page rather than its content. Using only the URL reduces the time needed for feature extraction since it does not need to download the content of the web page. Our approach deals with the complexity of web pages because it uses a multi-label classification where each web page can be assigned to more than one genre. Moreover, our approach implements a new weighting technique that exploits the structure of the URL. Experiments conducted on a known multi-label dataset show that our approach achieves encouraging results.
  • Keywords
    Internet; computational complexity; pattern classification; Web pages; character n-gram extraction; feature extraction; multilabel genre classification; pure URL-based genre classification; weighting technique; Classification algorithms; Feature extraction; HTML; Measurement; Search engines; Uniform resource locators; Web pages; URL structure; genre; multi-label classification; web page;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Database and Expert Systems Applications (DEXA), 2014 25th International Workshop on
  • Conference_Location
    Munich
  • ISSN
    1529-4188
  • Print_ISBN
    978-1-4799-5721-7
  • Type

    conf

  • DOI
    10.1109/DEXA.2014.56
  • Filename
    6974855