DocumentCode :
3751969
Title :
Spark-gram: Mining frequent N-grams using parallel processing in Spark
Author :
Prasetya Ajie Utama;Bayu Distiawan
Author_Institution :
Faculty of Computer Science, Universitas Indonesia, Jakarta, Indonesia
fYear :
2015
Firstpage :
129
Lastpage :
136
Abstract :
Mining sequence patterns in form of n-grams (sequences of words that appear consecutively) from a large text data is one of the fundamental parts in several information retrieval and natural language processing applications. In this work, we present Spark-gram, a method for large scale frequent sequence mining based on Spark that was adapted from its equivalent method in MapReduce called Suffix-σ. Spark-gram design allows the discovery of all n-grams with maximum length σ and minimum occurrence frequency τ, using iterative algorithm with only a single shuffle phase. We show that Spark-gram can outperform Suffix-σ mainly when τ is high but potentially worse when the value of σ grows higher.
Keywords :
"Sparks","Computational modeling","Data mining","Data models","Data processing","Adaptation models","Iterative methods"
Publisher :
ieee
Conference_Titel :
Advanced Computer Science and Information Systems (ICACSIS), 2015 International Conference on
Type :
conf
DOI :
10.1109/ICACSIS.2015.7415169
Filename :
7415169
Link To Document :
بازگشت