• DocumentCode
    1388401
  • Title

    Practical Efficient String Mining

  • Author

    Dhaliwal, Jasbir ; Puglisi, Simon J. ; Turpin, Andrew

  • Author_Institution
    Sch. of Comput. Sci. & Inf. Technol., RMIT Univ., Melbourne, VIC, Australia
  • Volume
    24
  • Issue
    4
  • fYear
    2012
  • fDate
    4/1/2012 12:00:00 AM
  • Firstpage
    735
  • Lastpage
    744
  • Abstract
    In recent years, several algorithms for mining frequent and emerging substring patterns from databases of string data (such as proteins and natural language texts) have been discovered, all of which traverse an enhanced suffix array data structure. All of these algorithms lie at either extreme of the efficiency spectrum; they are either fast and use enormous amounts of space, or they are compact and orders of magnitude slower. In this paper, we present an algorithm that achieves the best of both these extremes, having runtime comparable to the fastest published algorithms while using less space than the most space efficient ones. This excellent practical performance is underpinned by theoretical guarantees. Our main mechanism for keeping memory usage low is to build the enhanced suffix array incrementally, in blocks. Once built, a block is traversed to output patterns with required support before its space is reclaimed to be used for the next block.
  • Keywords
    data mining; data structures; database management systems; storage management; string matching; efficiency spectrum; emerging substring patterns; enhanced suffix array data structure; memory usage; mining frequent patterns; practical efficient string mining; published algorithms; string data; theoretical guarantees; Arrays; Data mining; Databases; Proteins; Runtime; Sorting; String mining; algorithms.; data mining; suffix array; suffix tree;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2010.242
  • Filename
    5645631