Title :
Performance growth for text(template extraction)
Author :
Sundar, G. Naveen ; Narmadha, D. ; Haran, A.P.
Author_Institution :
Sch. of Comput. Sci. & Technol., Karunya Univ., Coimbatore, India
Abstract :
Every individual is provided with access to plenty of information with the help of World Wide Web, but it becomes progressively more difficult to discover the significant pieces of information. In web mining tries to tackle this problem by applying data mining techniques to Web data and documents. The data available on the web is so heterogeneous and huge that it becomes a crucial factor to extract this accessible data to make it pertinent to a particular problem. Web mining uses data mining techniques to extract knowledge from web sources. This paper focuses on detecting and extracting templates from web pages that are heterogeneous in nature by means of an algorithm. Locality sensitive hashing finds the similarity between the input web documents and provides good performance compared to the Minimum Description Length (MDL) principle and hash cluster process in terms of execution time.
Keywords :
Web sites; data mining; document handling; information retrieval; MDL principle; Web data; Web documents; Web mining; Web pages; Web sources; World Wide Web; accessible data extraction; data mining techniques; hash cluster process; knowledge extraction; locality sensitive hashing; minimum description length; performance growth; template extraction; templates detection; templates extraction; text; Clustering algorithms; HTML; Logic gates; Web mining; Cluster; Minimum Description Length (MDL); Non-Content Path; Template Detection;
Conference_Titel :
Electronics and Communication Systems (ICECS), 2014 International Conference on
Conference_Location :
Coimbatore
Print_ISBN :
978-1-4799-2321-2
DOI :
10.1109/ECS.2014.6892798