DocumentCode
485445
Title
An improved classification method for the common OLE file by N-gram analysis and vector space model
Author
Hong-Rong Yang ; Ming Xu ; Ning Zheng
Author_Institution
Inst. of Comput. Applic. Technol., Hangzhou Dianzi Univ., Hangzhou
fYear
2007
fDate
12-14 Dec. 2007
Firstpage
983
Lastpage
986
Abstract
Identifying file type by file extension is fallible. Another magic bytes method for these files, which have similar header information, such as the common-used MS Office OLE file, may not distinguish one type from another. In this paper, an efficiently classification method for the common OLE files was proposed. In order to overcome the shortcoming of the original N-gram analysis technique which can not easily tell ambiguous file types apart, the N-gram analysis and the vector space model were combined together to identify the common OLE files. The characteristic items were extracted from the most frequency byte values of each file class, and then the cosine value of two vectors was used to catalogue ambiguous file types. The experiment results demonstrate that our mechanism is effective in identifying the office OLE files, and obtain better performance than the common n-gram method.
Keywords
file organisation; pattern classification; vectors; MS Office OLE file; N-gram analysis; ambiguous file types cataloguing; classification method; cosine value; file extension; file type identification; magic bytes method; vector space model; N-gram; OLE file; vector space model;
fLanguage
English
Publisher
iet
Conference_Titel
Wireless, Mobile and Sensor Networks, 2007. (CCWMSN07). IET Conference on
Conference_Location
Shanghai
ISSN
0537-9989
Print_ISBN
978-0-86341-836-5
Type
conf
Filename
4786369
Link To Document