• DocumentCode
    655400
  • Title

    Near Duplicate Web Page Detection with Analytic Feature Weighting

  • Author

    Naseem, Rashid ; Anees, Sanya ; Muneer, K. ; Farook, K. Syed

  • Author_Institution
    Dept. of Comput. Sci. & Eng., KMEA Eng. Coll., Aluva, India
  • fYear
    2013
  • fDate
    29-31 Aug. 2013
  • Firstpage
    324
  • Lastpage
    327
  • Abstract
    Near duplicate web pages are web pages that differ only slightly in content. The existence of near duplicate web pages are due to exact replica of the original site, mirrored sites, versioned sites, and multiple representations of the same physical object and plagiarized documents. The identification of similar or near duplicate pages in a large collection is a significant problem with wide spread applications. Here we propose a four stage algorithm for finding near duplicates of an input Web page from a huge repository. We propose a Term Document Weight (TDW) matrix based algorithm with four phases - preprocessing, Feature weighting, Filtering and Verification. The system receives an input web page and a similarity threshold in its first phase and performs some pre processing operations on it. In the second phase, weights of features are calculated using Analytic Combination Criteria (ACC). In the third phase, Prefix and Positional filtering are performed to reduce the size of candidate records, and it returns an optimal set of near duplicate web pages in the Verification phase after calculating their similarity using Minimum Weight Overlapping (MWO) method.
  • Keywords
    Web sites; document handling; formal verification; information filtering; ACC; MWO method; TDW matrix based algorithm; analytic combination criteria; analytic feature weighting; duplicate pages; four stage algorithm; huge repository; input Web page; minimum weight overlapping method; near duplicate Web page detection; plagiarized documents; positional filtering; similarity threshold; term document weight matrix based algorithm; verification phase; Algorithm design and analysis; Filtering; Google; Standards; Upper bound; Web pages; Analytic Combination Criteria; Minimum Weight Overlapping; Near Duplicate Detection; Positional filtering; Prefix filtering; Term Document Weight Matrix; Web page classification;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Advances in Computing and Communications (ICACC), 2013 Third International Conference on
  • Conference_Location
    Cochin
  • Type

    conf

  • DOI
    10.1109/ICACC.2013.69
  • Filename
    6686399