• DocumentCode
    228810
  • Title

    Performance growth for text(template extraction)

  • Author

    Sundar, G. Naveen ; Narmadha, D. ; Haran, A.P.

  • Author_Institution
    Sch. of Comput. Sci. & Technol., Karunya Univ., Coimbatore, India
  • fYear
    2014
  • fDate
    13-14 Feb. 2014
  • Firstpage
    1
  • Lastpage
    6
  • Abstract
    Every individual is provided with access to plenty of information with the help of World Wide Web, but it becomes progressively more difficult to discover the significant pieces of information. In web mining tries to tackle this problem by applying data mining techniques to Web data and documents. The data available on the web is so heterogeneous and huge that it becomes a crucial factor to extract this accessible data to make it pertinent to a particular problem. Web mining uses data mining techniques to extract knowledge from web sources. This paper focuses on detecting and extracting templates from web pages that are heterogeneous in nature by means of an algorithm. Locality sensitive hashing finds the similarity between the input web documents and provides good performance compared to the Minimum Description Length (MDL) principle and hash cluster process in terms of execution time.
  • Keywords
    Web sites; data mining; document handling; information retrieval; MDL principle; Web data; Web documents; Web mining; Web pages; Web sources; World Wide Web; accessible data extraction; data mining techniques; hash cluster process; knowledge extraction; locality sensitive hashing; minimum description length; performance growth; template extraction; templates detection; templates extraction; text; Clustering algorithms; HTML; Logic gates; Web mining; Cluster; Minimum Description Length (MDL); Non-Content Path; Template Detection;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Electronics and Communication Systems (ICECS), 2014 International Conference on
  • Conference_Location
    Coimbatore
  • Print_ISBN
    978-1-4799-2321-2
  • Type

    conf

  • DOI
    10.1109/ECS.2014.6892798
  • Filename
    6892798