Title :
Performance Optimization of Focused Web Crawling Using Content Block Segmentation
Author :
Ganguly, Bishwaroop ; Raich, Devashri
Author_Institution :
Dept. of C.S.E, RCERT, Chandrapur, India
Abstract :
The World Wide Web (WWW) is a collection of billions of documents formatted using HTML. Web Search engines are used to find the desired information on the World Wide Web. Whenever a user query is inputted, searching is performed through that database. The size of repository of search engine is not enough to accommodate every page available on the web. So it is desired that only the most relevant pages must be stored in the database. So, to store those most relevant pages from the World Wide Web, a better approach has to be followed. The software that traverses web for getting the relevant pages is called "Crawlers" or "Spiders". A specialized crawler called focused crawler traverses the web and selects the relevant pages to a defined topic rather than to explore all the regions of the web page. The crawler does not collect all the web pages, but retrieves only the relevant pages out of all. So the major problem is how to retrieve the relevant and quality web pages. To address this problem, in this paper, we have designed and implemented an algorithm which partitions the web pages on the basis of headings into content blocks and then calculates the relevancy of each partitioned block in web page. Then the page relevancy is calculated by sum of all block relevancy scores in one page. It also calculates the URL score and identifies whether the URL is relevant to a topic or not On the basis of headings, there is an appropriate division of pages into blocks because a complete block comprises of the heading, content, images, links, tables and sub tables of a particular block only.
Keywords :
Web sites; query processing; search engines; HTMl; URL score; WWW; Web pages; Web search engines; World Wide Web; content block; content block segmentation; database; focused Web crawling; focused crawler; Crawlers; Databases; HTML; Ontologies; Search engines; Web pages; Information Retrieval; Web crawling algorithms; focused crawling algorithm; page rank; search engine;
Conference_Titel :
Electronic Systems, Signal Processing and Computing Technologies (ICESC), 2014 International Conference on
Conference_Location :
Nagpur
DOI :
10.1109/ICESC.2014.69