• DocumentCode
    517699
  • Title

    An Improvement to TF: Term Distribution Based Term Weight Algorithm

  • Author

    Xia Tian ; Tong, Wang

  • Author_Institution
    Dept. of Comput. & Inf. Sci., Shanghai Second Polytech. Univ., Shanghai, China
  • Volume
    1
  • fYear
    2010
  • fDate
    24-25 April 2010
  • Firstpage
    252
  • Lastpage
    255
  • Abstract
    In the process of document formalization, term weight algorithm plays an important role. It greatly interferes the precision and recall results of the natural language processing(NLP) systems. Currently, TF-IDF term weight algorithm is widely applied into language models to build NLP Systems. Since term frequency is not the only discriminator which is necessary to be considered when calculating the term weight and make it suitable to indicate term importance, we are motivated to investigate other statistical characteristics of terms and found an important discriminator: term distribution. Furthermore, we found that a term with higher frequency and close to hypo-dispersion distribution should be given higher weight than one with lower frequency and close to intensive distribution. Based on this hypothesis, by leveraging the Pearson Chi-square Test Statistic, a Term Distribution based Term Weight Algorithm is put forward in this paper. Also, the experiment results at the end of this paper approve the reliability and efficiency of the algorithm.
  • Keywords
    natural language processing; Pearson chi-square test statistic; natural language processing; term weight algorithm; Computer networks; Computer security; Distributed computing; Frequency; Information retrieval; Information science; Information security; Space technology; Statistical distributions; Wireless communication; IDF; Natural Language Processing; TF; Term Weight;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Networks Security Wireless Communications and Trusted Computing (NSWCTC), 2010 Second International Conference on
  • Conference_Location
    Wuhan, Hubei
  • Print_ISBN
    978-0-7695-4011-5
  • Electronic_ISBN
    978-1-4244-6598-9
  • Type

    conf

  • DOI
    10.1109/NSWCTC.2010.66
  • Filename
    5480657