Abstract :
Nowadays, internet becomes useful source of information in day-to-day life. It creates huge development of World Wide Web in its quantity of interchange and its size and difficulty of websites. Web Usage Mining (WUM) is one of the main applications of data mining, artificial intelligence and so on to the web data and forecast the user´s visiting behaviors and obtains their interests by investigating the samples. Since WUM directly involves in large range of applications, such as, ecommerce, e-learning, Web analytics, information retrieval etc. Weblog data is one of the major sources which contain all the information regarding the users visited links, browsing patterns, time spent on a particular page or link and this information can be used in several applications like adaptive web sites, modified services, customer summary, pre-fetching, generate attractive web sites etc. There are several problems related with the existing web usage mining approaches. Existing web usage mining algorithms suffer from difficulty of practical applicability. So, a novel research is necessary for the accurate prediction of future performance of web users with rapid execution time. WUM consists of preprocessing, pattern discovery and pattern analysis. Log data is characteristically noisy and unclear. Hence, preprocessing is an essential process for effective mining process. In this paper, a novel pre-processing technique is proposed by removing local and global noise and web robots. Anonymous Microsoft Web Dataset and MSNBC.com Anonymous Web Dataset are used for estimating the proposed preprocessing technique.
Keywords :
Internet; Web sites; data analysis; data mining; information retrieval; pattern recognition; Anonymous Microsoft Web Dataset; Internet; MSNBC.com Anonymous Web Dataset; WUM; Web analytics; Web data; Web log mining; Web robot removal; Web usage mining algorithm; World Wide Web; adaptive Web sites; artificial intelligence; browsing pattern; customer summary; data mining; e-commerce; e-learning; enhanced preprocessing technique; global noise removal; information retrieval; information source; local noise removal; log data preprocessing; mining process; pattern analysis; pattern discovery; prefetching; user interest; user visiting behavior forecasting; visited links; Content Path Set; Data Cleaning; Path Completion; Preprocessing; Travel Path set;