• DocumentCode
    1659632
  • Title

    Babouk: Focused Web Crawling for Corpus Compilation and Automatic Terminology Extraction

  • Author

    De Groc, Clément

  • Author_Institution
    Syllabs, Univ. Paris Sud, Paris, France
  • Volume
    1
  • fYear
    2011
  • Firstpage
    497
  • Lastpage
    498
  • Abstract
    The use of the World Wide Web as a free source for large linguistic resources is a well-established idea. Such resources are keystones to domains such as lexicon-based categorization, information retrieval, machine translation and information extraction. In this paper, we present an industrial focused web crawler for the automatic compilation of specialized corpora from the web. This application, created within the framework of the TTC project, is used daily by several linguists to bootstrap large thematic corpora which are then used to automatically generate bilingual terminologies.
  • Keywords
    Internet; feature extraction; information retrieval; language translation; Babouk; TTC project; Web crawling; World Wide Web; automatic compilation; automatic terminology extraction; corpus compilation; information extraction; information retrieval; lexicon-based categorization; linguistic resource; machine translation; Computer networks; Crawlers; HTML; Pragmatics; Terminology; Upper bound; Web sites; focused crawling; resources bootstrapping; web-as-corpus;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Web Intelligence and Intelligent Agent Technology (WI-IAT), 2011 IEEE/WIC/ACM International Conference on
  • Conference_Location
    Lyon
  • Print_ISBN
    978-1-4577-1373-6
  • Electronic_ISBN
    978-0-7695-4513-4
  • Type

    conf

  • DOI
    10.1109/WI-IAT.2011.253
  • Filename
    6040719