• DocumentCode
    1152432
  • Title

    Multilayer SOM With Tree-Structured Data for Efficient Document Retrieval and Plagiarism Detection

  • Author

    Chow, Tommy W S ; Rahman, M.K.M.

  • Author_Institution
    Dept. of Electron. Eng., City Univ. of Hong Kong, Kowloon, China
  • Volume
    20
  • Issue
    9
  • fYear
    2009
  • Firstpage
    1385
  • Lastpage
    1402
  • Abstract
    This paper proposes a new document retrieval (DR) and plagiarism detection (PD) system using multilayer self-organizing map (MLSOM). A document is modeled by a rich tree-structured representation, and a SOM-based system is used as a computationally effective solution. Instead of relying on keywords/lines, the proposed scheme compares a full document as a query for performing retrieval and PD. The tree-structured representation hierarchically includes document features as document, pages, and paragraphs. Thus, it can reflect underlying context that is difficult to acquire from the currently used word-frequency information. We show that the tree-structured data is effective for DR and PD. To handle tree-structured representation in an efficient way, we use an MLSOM algorithm, which was previously developed by the authors for the application of image retrieval. In this study, it serves as an effective clustering algorithm. Using the MLSOM, local matching techniques are developed for comparing text documents. Two novel MLSOM-based PD methods are proposed. Detailed simulations are conducted and the experimental results corroborate that the proposed approach is computationally efficient and accurate for DR and PD.
  • Keywords
    image retrieval; query processing; self-organising feature maps; tree data structures; MLSOM algorithm; clustering algorithm; document retrieval system; image retrieval; local matching techniques; multilayer selforganizing map; plagiarism detection system; simulation; tree-structured representation; word-frequency information; Document retrieval (DR); multilayer self-organizing map (MLSOM); plagiarism detection (PD); tree-structured representation; Algorithms; Artificial Intelligence; Cluster Analysis; Computer Simulation; Humans; Information Storage and Retrieval; Internet; Neural Networks (Computer); Neurons; Pattern Recognition, Automated; Plagiarism; Time Factors;
  • fLanguage
    English
  • Journal_Title
    Neural Networks, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1045-9227
  • Type

    jour

  • DOI
    10.1109/TNN.2009.2023394
  • Filename
    5175377