• DocumentCode
    3751969
  • Title

    Spark-gram: Mining frequent N-grams using parallel processing in Spark

  • Author

    Prasetya Ajie Utama;Bayu Distiawan

  • Author_Institution
    Faculty of Computer Science, Universitas Indonesia, Jakarta, Indonesia
  • fYear
    2015
  • Firstpage
    129
  • Lastpage
    136
  • Abstract
    Mining sequence patterns in form of n-grams (sequences of words that appear consecutively) from a large text data is one of the fundamental parts in several information retrieval and natural language processing applications. In this work, we present Spark-gram, a method for large scale frequent sequence mining based on Spark that was adapted from its equivalent method in MapReduce called Suffix-σ. Spark-gram design allows the discovery of all n-grams with maximum length σ and minimum occurrence frequency τ, using iterative algorithm with only a single shuffle phase. We show that Spark-gram can outperform Suffix-σ mainly when τ is high but potentially worse when the value of σ grows higher.
  • Keywords
    "Sparks","Computational modeling","Data mining","Data models","Data processing","Adaptation models","Iterative methods"
  • Publisher
    ieee
  • Conference_Titel
    Advanced Computer Science and Information Systems (ICACSIS), 2015 International Conference on
  • Type

    conf

  • DOI
    10.1109/ICACSIS.2015.7415169
  • Filename
    7415169