DocumentCode :
2337018
Title :
Use of text syntactical structures in detection of document duplicates
Author :
Elhadi, Mohamed ; Al-Tobi, Amjad
Author_Institution :
Dept. of Comput. Sci., Sultan Qaboos Univ., Al-Khod
fYear :
2008
fDate :
13-16 Nov. 2008
Firstpage :
520
Lastpage :
525
Abstract :
This is the first paper on a set of experiments addressing issues related to the determination of text similarity using a combined syntactical representation and string alignment techniques. The suggested approach takes advantage of document syntactical structure manifested in Part of Speech (POS) tags and uses it as a basis for further processing. Documents, including the query, are pre-processed using a POS tagger converting them into a reduced string that captures some of the writing style of authors and some of the semantics of the written text. This provides means of representing a document on a higher level of abstraction that captures the different alterations that can be done on documents that are similar in origin and possibly in style. This in turn enables processing of such documents using many of the available string manipulation and matching algorithms. This work is inspired and driven by the parallel between text processing and sequence alignment in computational biology. Sequence alignment techniques are used to analyze and establish the utility of using strings produced as a result of this syntactical representation of content.
Keywords :
query processing; string matching; text analysis; word processing; Part of Speech tags; document duplicate detection; document processing; matching algorithms; query processing; sequence alignment; string alignment; syntactical representation; text processing; text similarity; text syntactical structures; Bioinformatics; Computational biology; Computer science; Proteins; Sequences; Speech processing; Tagging; Text analysis; Text processing; Writing;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Digital Information Management, 2008. ICDIM 2008. Third International Conference on
Conference_Location :
London
Print_ISBN :
978-1-4244-2916-5
Electronic_ISBN :
978-1-4244-2917-2
Type :
conf
DOI :
10.1109/ICDIM.2008.4746719
Filename :
4746719
Link To Document :
بازگشت