• DocumentCode
    3433602
  • Title

    Phrase-based text representation for managing the Web documents

  • Author

    Sharma, Rupali ; Raman, S.

  • Author_Institution
    Dept. of Comput. Sci. & Eng., Indian Inst. of Technol. Madras, Chennai, India
  • fYear
    2003
  • fDate
    28-30 April 2003
  • Firstpage
    165
  • Lastpage
    169
  • Abstract
    The World Wide Web has provided the facility of bringing information to the fingertips of its users. Since most of the documents available on the web are machine-readable but not machine-understandable, ensuring the retrieval of relevant information continues to be a difficult task. In the traditional text representation approach, high frequency keywords are used as term representatives of text. However, the main drawbacks in this approach are lack of direct relationship between word frequency and its importance, and the effect of the word ambiguities. Considering these shortcomings of the keyword-based method, this paper presents a phrase-based text representation approach that uses rule-based natural language processing (NLP) techniques. Extraction of key-phrases from text documents is based on a process of partial parsing. By making the indexing terms more meaningful through reduction of the ambiguity in words considered in isolation, improvement in retrieval effectiveness is sought to be achieved.
  • Keywords
    grammars; indexing; information retrieval; natural languages; text analysis; Web document management; World Wide Web; indexing terms; key phrase extraction; keyword-based method; partial parsing; phrase-based text representation; rule-based natural language processing techniques; text documents; word ambiguities; word frequency; Frequency; Indexing; Information retrieval; Joining processes; Libraries; Resource description framework; Semantic Web; Technology management; Web sites; XML;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Information Technology: Coding and Computing [Computers and Communications], 2003. Proceedings. ITCC 2003. International Conference on
  • Print_ISBN
    0-7695-1916-4
  • Type

    conf

  • DOI
    10.1109/ITCC.2003.1197520
  • Filename
    1197520