Spark-gram: Mining frequent N-grams using parallel processing in Spark

Author

Prasetya Ajie Utama;Bayu Distiawan

Author_Institution

Faculty of Computer Science, Universitas Indonesia, Jakarta, Indonesia

fYear

2015

Firstpage

129

Lastpage

136

Abstract

Mining sequence patterns in form of n-grams (sequences of words that appear consecutively) from a large text data is one of the fundamental parts in several information retrieval and natural language processing applications. In this work, we present Spark-gram, a method for large scale frequent sequence mining based on Spark that was adapted from its equivalent method in MapReduce called Suffix-σ. Spark-gram design allows the discovery of all n-grams with maximum length σ and minimum occurrence frequency τ, using iterative algorithm with only a single shuffle phase. We show that Spark-gram can outperform Suffix-σ mainly when τ is high but potentially worse when the value of σ grows higher.

Keywords

"Sparks","Computational modeling","Data mining","Data models","Data processing","Adaptation models","Iterative methods"

Publisher

ieee

Conference_Titel

Advanced Computer Science and Information Systems (ICACSIS), 2015 International Conference on

Type

conf

DOI

10.1109/ICACSIS.2015.7415169

Filename

7415169

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=3751969