• DocumentCode
    2980102
  • Title

    Improving Data Processing Time with Access Sequence Prediction

  • Author

    Boonserm, P. ; Bingqiang Wang ; See, Solomon ; Achalakul, Tiranee

  • Author_Institution
    Dept. of Comput. Eng., King Mongkut´s Univ. of Technol. Thonburi, Bangkok, Thailand
  • fYear
    2012
  • fDate
    17-19 Dec. 2012
  • Firstpage
    770
  • Lastpage
    775
  • Abstract
    Genomic research nowadays often faces the problem of big data. The data size from genome sequencing process can grow very quickly and continuously creating the problem with storage and processing. BGI, one of the renowned genomic research institutes in China also faces the similar problem. The research at BGI depends on several sequencing machines. One machine pipeline may generate temporary data of around 1.4 terabytes. In addition, multiple read and write operations occur continuously during processing time. The I/O bottleneck thus degrades research throughput tremendously. Using a high performance computing system alone is not sufficiently effective in experimental results processing. In order to hide the I/O latency, an effective big data management framework is needed at BGI. In this paper, we proposed the hybrid prediction model for data access pattern. The goal is to predict the next pieces of data needed in the processor and preload them into the memory in order to improve the overall processing time. From the results obtained from the initial experiments, the proposed model can deliver high prediction accuracy in linear-time. Moreover, the error rate is low at 1.85%, which is better than the common methods used, such as Prediction Graph, ANN and ARMA. We believe that with some further fine-tuning, the model can be used as a part of the big data management framework deployed at BGI in the near future.
  • Keywords
    genomics; information retrieval; medical information systems; pipeline processing; BGI; I/O bottleneck; I/O latency; data access pattern; data access sequence prediction; data management framework; data processing time improvement; data storage; genome sequencing process; hybrid prediction model; pipeline processing; read and write operation; sequencing machine; Autoregressive processes; Complexity theory; Computational modeling; Data models; Mathematical model; Prediction algorithms; Predictive models; Big Data; Hybrid ARMA Model; I/O Bottleneck; Paired t-Test;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel and Distributed Systems (ICPADS), 2012 IEEE 18th International Conference on
  • Conference_Location
    Singapore
  • ISSN
    1521-9097
  • Print_ISBN
    978-1-4673-4565-1
  • Electronic_ISBN
    1521-9097
  • Type

    conf

  • DOI
    10.1109/ICPADS.2012.125
  • Filename
    6413607