• DocumentCode
    485445
  • Title

    An improved classification method for the common OLE file by N-gram analysis and vector space model

  • Author

    Hong-Rong Yang ; Ming Xu ; Ning Zheng

  • Author_Institution
    Inst. of Comput. Applic. Technol., Hangzhou Dianzi Univ., Hangzhou
  • fYear
    2007
  • fDate
    12-14 Dec. 2007
  • Firstpage
    983
  • Lastpage
    986
  • Abstract
    Identifying file type by file extension is fallible. Another magic bytes method for these files, which have similar header information, such as the common-used MS Office OLE file, may not distinguish one type from another. In this paper, an efficiently classification method for the common OLE files was proposed. In order to overcome the shortcoming of the original N-gram analysis technique which can not easily tell ambiguous file types apart, the N-gram analysis and the vector space model were combined together to identify the common OLE files. The characteristic items were extracted from the most frequency byte values of each file class, and then the cosine value of two vectors was used to catalogue ambiguous file types. The experiment results demonstrate that our mechanism is effective in identifying the office OLE files, and obtain better performance than the common n-gram method.
  • Keywords
    file organisation; pattern classification; vectors; MS Office OLE file; N-gram analysis; ambiguous file types cataloguing; classification method; cosine value; file extension; file type identification; magic bytes method; vector space model; N-gram; OLE file; vector space model;
  • fLanguage
    English
  • Publisher
    iet
  • Conference_Titel
    Wireless, Mobile and Sensor Networks, 2007. (CCWMSN07). IET Conference on
  • Conference_Location
    Shanghai
  • ISSN
    0537-9989
  • Print_ISBN
    978-0-86341-836-5
  • Type

    conf

  • Filename
    4786369