Title :
Near Duplicate Web Page Detection with Analytic Feature Weighting
Author :
Naseem, Rashid ; Anees, Sanya ; Muneer, K. ; Farook, K. Syed
Author_Institution :
Dept. of Comput. Sci. & Eng., KMEA Eng. Coll., Aluva, India
Abstract :
Near duplicate web pages are web pages that differ only slightly in content. The existence of near duplicate web pages are due to exact replica of the original site, mirrored sites, versioned sites, and multiple representations of the same physical object and plagiarized documents. The identification of similar or near duplicate pages in a large collection is a significant problem with wide spread applications. Here we propose a four stage algorithm for finding near duplicates of an input Web page from a huge repository. We propose a Term Document Weight (TDW) matrix based algorithm with four phases - preprocessing, Feature weighting, Filtering and Verification. The system receives an input web page and a similarity threshold in its first phase and performs some pre processing operations on it. In the second phase, weights of features are calculated using Analytic Combination Criteria (ACC). In the third phase, Prefix and Positional filtering are performed to reduce the size of candidate records, and it returns an optimal set of near duplicate web pages in the Verification phase after calculating their similarity using Minimum Weight Overlapping (MWO) method.
Keywords :
Web sites; document handling; formal verification; information filtering; ACC; MWO method; TDW matrix based algorithm; analytic combination criteria; analytic feature weighting; duplicate pages; four stage algorithm; huge repository; input Web page; minimum weight overlapping method; near duplicate Web page detection; plagiarized documents; positional filtering; similarity threshold; term document weight matrix based algorithm; verification phase; Algorithm design and analysis; Filtering; Google; Standards; Upper bound; Web pages; Analytic Combination Criteria; Minimum Weight Overlapping; Near Duplicate Detection; Positional filtering; Prefix filtering; Term Document Weight Matrix; Web page classification;
Conference_Titel :
Advances in Computing and Communications (ICACC), 2013 Third International Conference on
Conference_Location :
Cochin
DOI :
10.1109/ICACC.2013.69