The Similarity Computing of Documents Based on VSM

Author

Guo, Qinglin

Author_Institution

Sch. of Comput. Sci. & Technol., North China Electr. Power Univ., Beijing

fYear

2008

fDate

July 28 2008-Aug. 1 2008

Firstpage

585

Lastpage

586

Abstract

The precision and efficiency of the similarity computing of documents is the foundation and key of other documents processing. In this paper, the DF and TF-IDF algorithms are improved. First, DF´s time complexity is linear which suits mass documents processing, but it has the fault that exceptional useful features may be deleted, so we make up that by adding the count of the words at the important places. Second, we rectify the weight of feature by the result of feature selection phase. In this way, we improve the precision of documents similarity without adding much time and space complexity.

Keywords

computational complexity; document handling; TF-IDF algorithms; VSM; documents similarity computing; feature selection phase; mass documents processing; space complexity; time complexity; Application software; Computer applications; Computer science; Data mining; Entropy; Frequency; Information retrieval; Internet; Mutual information; Organizing; TF-IDF; VSM; documents similarity; feature selection;

fLanguage

English

Publisher

ieee

Conference_Titel

Computer Software and Applications, 2008. COMPSAC '08. 32nd Annual IEEE International

Conference_Location

Turku

ISSN

0730-3157

Print_ISBN

978-0-7695-3262-2

Electronic_ISBN

0730-3157

Type

conf

DOI

10.1109/COMPSAC.2008.196

Filename

4591626