• DocumentCode
    2189694
  • Title

    DT-Tree: A Semantic Representation of Scientific Papers

  • Author

    Rizvi, Syed Raza Ali ; Wang, Shawn Xiong

  • Author_Institution
    Dept. of Comput. Sci., California State Univ. Fullerton, Fullerton, CA, USA
  • fYear
    2010
  • fDate
    June 29 2010-July 1 2010
  • Firstpage
    1280
  • Lastpage
    1284
  • Abstract
    With the tremendous growth in electronic publication, locating the most relevant references is becoming a challenging task. Most effective document indexing structures represent a document as a vector of very high dimensionality. It is well known that such a representation suffers from the curse of dimensionality. In this paper, we introduce DT-Tree (DocumentTerm-Tree) - a new structure for the representation of scientific documents. DT-Tree represents a document using the 50 most frequent terms in that specific document. These terms are grouped into a tree structure according to where they appear in the document, such as title, abstract, or section title, etc. The distance between two documents is calculated based on their DT-Trees. Two DTTrees are compared using Dice coefficient between the corresponding nodes of the trees. To verify the effectiveness of our similarity measure, we conducted experiments to cluster 150 documents in three categories, namely biology, chemistry and physics. The experimental results indicated 100% accuracy.
  • Keywords
    document handling; electronic publishing; indexing; trees (mathematics); DT tree; dice coefficient; document indexing; document term tree; electronic publication; scientific paper; semantic representation; Algorithm design and analysis; Biochemistry; Clustering algorithms; Pediatrics; Physics; Analysis; Document clustering; Similarity Measure; dimension reduction; key term extraction; kmeans; sparsity;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer and Information Technology (CIT), 2010 IEEE 10th International Conference on
  • Conference_Location
    Bradford
  • Print_ISBN
    978-1-4244-7547-6
  • Type

    conf

  • DOI
    10.1109/CIT.2010.231
  • Filename
    5577872