DocumentCode
81709
Title
Removing DUST Using Multiple Alignment of Sequences
Author
Rodrigues, Kaio ; Cristo, Marco ; S de Moura, Edleno ; da Silva, Altigran
Author_Institution
Inst. of Comput. Sci., Fed. Univ. of Amazonas, Manaus, Brazil
Volume
27
Issue
8
fYear
2015
fDate
Aug. 1 2015
Firstpage
2261
Lastpage
2274
Abstract
A large number of URLs collected by web crawlers correspond to pages with duplicate or near-duplicate contents. To crawl, store, and use such duplicated data implies a waste of resources, the building of low quality rankings, and poor user experiences. To deal with this problem, several studies have been proposed to detect and remove duplicate documents without fetching their contents. To accomplish this, the proposed methods learn normalization rules to transform all duplicate URLs into the same canonical form. A challenging aspect of this strategy is deriving a set of general and precise rules. In this work, we present DUSTER, a new approach to derive quality rules that take advantage of a multi-sequence alignment strategy. We demonstrate that a full multi-sequence alignment of URLs with duplicated content, before the generation of the rules, can lead to the deployment of very effective rules. By evaluating our method, we observed it achieved larger reductions in the number of duplicate URLs than our best baseline, with gains of 82 and 140.74 percent in two different web collections.
Keywords
Internet; data mining; information retrieval; DUSTER; URL; Web collections; Web crawlers; content fetching; duplicate document removal; multiple alignment; multisequence alignment strategy; near-duplicate contents; ranking quality; Algorithm design and analysis; Crawlers; Noise; Search engines; Training; Transforms; Uniform resource locators; Web technology; web crawling and normalization rules;
fLanguage
English
Journal_Title
Knowledge and Data Engineering, IEEE Transactions on
Publisher
ieee
ISSN
1041-4347
Type
jour
DOI
10.1109/TKDE.2015.2407354
Filename
7050346
Link To Document