• DocumentCode
    57027
  • Title

    Sceadan: Using Concatenated N-Gram Vectors for Improved File and Data Type Classification

  • Author

    Beebe, Nicole L. ; Maddox, Laurence A. ; Lishu Liu ; Minghe Sun

  • Author_Institution
    Inf. Syst. & Cyber Security Dept., Univ. of Texas at San Antonio, San Antonio, TX, USA
  • Volume
    8
  • Issue
    9
  • fYear
    2013
  • fDate
    Sept. 2013
  • Firstpage
    1519
  • Lastpage
    1530
  • Abstract
    Over 20 studies have been published in the past decade involving file and data type classification for digital forensics and information security applications. Methods using n-grams as inputs have proven the most successful across a wide variety of types; however, there are mixed results regarding the utility of unigrams and bigrams as inputs independently. In this study, we use support vector machines (SVMs) consisting of unigrams and bigrams, as well as complexity and other byte frequency-based measures, as inputs. Using concatenated unigrams and bigrams as input and a linear kernel SVM, we achieve significantly improved results over those previously reported (73.4% classification rate across 38 file and data types). We are the first to use concatenated n-grams as the sole input, and we show their superiority over inputs used previously. We also found that too many different types of features as inputs result in overfitting and poor generalization properties. We include several types seldom or not studied in the past (Microsoft Office 2010 files, file system data, base64, base85, URL encoding, flash video, M4A, MP4, WMV, and JSON records). The “winning” approach is instantiated in an open source software tool called Sceadan - Systematic Classification Engine for Advanced Data ANalysis.
  • Keywords
    computational complexity; data analysis; digital forensics; file organisation; pattern classification; public domain software; software tools; support vector machines; Sceadan-systematic classification engine; advanced data analysis; byte frequency-based measure; complexity; concatenated N-gram vector; concatenated bigram; concatenated unigram; data type classification; digital forensics; file type classification; information security application; linear kernel SVM; open source software tool; support vector machine; winning approach; Classification algorithms; Complexity theory; Frequency measurement; Kernel; Support vector machine classification; Training; Data type classification; digital forensics; file type classification; n-gram; support vector machine;
  • fLanguage
    English
  • Journal_Title
    Information Forensics and Security, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1556-6013
  • Type

    jour

  • DOI
    10.1109/TIFS.2013.2274728
  • Filename
    6567922