• DocumentCode
    81709
  • Title

    Removing DUST Using Multiple Alignment of Sequences

  • Author

    Rodrigues, Kaio ; Cristo, Marco ; S de Moura, Edleno ; da Silva, Altigran

  • Author_Institution
    Inst. of Comput. Sci., Fed. Univ. of Amazonas, Manaus, Brazil
  • Volume
    27
  • Issue
    8
  • fYear
    2015
  • fDate
    Aug. 1 2015
  • Firstpage
    2261
  • Lastpage
    2274
  • Abstract
    A large number of URLs collected by web crawlers correspond to pages with duplicate or near-duplicate contents. To crawl, store, and use such duplicated data implies a waste of resources, the building of low quality rankings, and poor user experiences. To deal with this problem, several studies have been proposed to detect and remove duplicate documents without fetching their contents. To accomplish this, the proposed methods learn normalization rules to transform all duplicate URLs into the same canonical form. A challenging aspect of this strategy is deriving a set of general and precise rules. In this work, we present DUSTER, a new approach to derive quality rules that take advantage of a multi-sequence alignment strategy. We demonstrate that a full multi-sequence alignment of URLs with duplicated content, before the generation of the rules, can lead to the deployment of very effective rules. By evaluating our method, we observed it achieved larger reductions in the number of duplicate URLs than our best baseline, with gains of 82 and 140.74 percent in two different web collections.
  • Keywords
    Internet; data mining; information retrieval; DUSTER; URL; Web collections; Web crawlers; content fetching; duplicate document removal; multiple alignment; multisequence alignment strategy; near-duplicate contents; ranking quality; Algorithm design and analysis; Crawlers; Noise; Search engines; Training; Transforms; Uniform resource locators; Web technology; web crawling and normalization rules;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2015.2407354
  • Filename
    7050346