• DocumentCode
    2877558
  • Title

    Identifying Multi-Word Terms by Text-Segments

  • Author

    CHEN, Jisong ; Yeh, Chung-Hsing ; Chau, Rowena

  • Author_Institution
    Monash University, Australia
  • fYear
    2006
  • fDate
    38869
  • Firstpage
    19
  • Lastpage
    19
  • Abstract
    Traditional statistical approaches for identifying multi-word terms have to handle a large amount of noisy data and are extremely time consuming. This paper presents a new statistical approach for identifying multiword terms based on the co-related text-segments existing in a group of documents. The approach involves three stages: (a) using a short predefined stoplist as an initial input to segment a set of text documents into text-segments, (b) calculating the segment-weights of all text-segments, and (c) applying the short text-segments to segment the longer text-segments based on the weight values. The newly generated text-segments then segment each other again until all text-segments cannot be further divided. The resultant text-segments are identified as terms based on a specified threshold. The initial experimental result on a set of traditional Chinese documents shows that this approach can achieve a minimum of 76.39% of recall rate and a minimum of 91.05% of precision rate on retrieving multiple occurrences terms, including 18.30% of new identified terms. The approach can be applied to identify multi-word terms in any languages.
  • Keywords
    Australia; Data mining; Frequency; Government; Information management; Information technology; Natural languages; Speech; Statistical analysis; Statistics;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Web-Age Information Management Workshops, 2006. WAIM '06. Seventh International Conference on
  • Conference_Location
    Hong Kong, China
  • Print_ISBN
    0-7695-2705-1
  • Type

    conf

  • DOI
    10.1109/WAIMW.2006.16
  • Filename
    4027179