مرکز منطقه ای اطلاع رساني علوم و فناوري - Near-duplicate web page detection by enhanced TDW and simHash technique

Abstract :

Internet is one of the imperative explosion in communication and information retrieval. This massive development of web prompts host millions of web pages in heterogeneous platform. Due to the lack of a standard mechanism to guarantee the nonexistence of a web page before hosting them in the server leads to increases the near duplicate pages in the internet. These near duplicate content can exist either by intentional or accidental. The issue of finding near-duplicate web pages has been a subject of research in the database and web-scan groups for a few years. Since most winning content mining strategies received term-based methodologies, they all experience an issues of word synonym and substantial number of comparison. In this paper we propose a method, which deal with the detection of near and duplicate web pages detection by using an extended term document weighting scheme, sentence level features and simHash technique. The existence of these near and duplicate web pages causes the problems that range from network band width utilization, storage cost, reduce the performance of search engines by duplicated content indexing and increase load on a remote host.