Clustering of Short Strings in Large Databases

Author

Kazimianec, Michail ; Mazeika, Arturas

Author_Institution

Fac. of Comput. Sci., Free Univ. of Bozen-Bolzano, Bolzano, Italy

fYear

2009

fDate

Aug. 31 2009-Sept. 4 2009

Firstpage

368

Lastpage

372

Abstract

A novel method CLOSS intended for textual databases is proposed. It successfully identifies misspelled string clusters, even if the cluster border is not prominent. The method uses q-gram approach to represent data and a string proximity graph to find the cluster. Contribution refers to short string clustering in text mining, when the proximity graph has multiple horizontal lines or the line is not present.

Keywords

data mining; pattern clustering; string matching; text analysis; very large databases; CLOSS; cluster border; clustering of short strings; large databases; q-gram approach; string proximity graph; text mining; textual databases; Application software; Clustering methods; Computer science; Databases; Detection algorithms; Expert systems; Robustness; Smoothing methods; Tagging; Text mining; clustering; q-grams; short strings;

fLanguage

English

Publisher

ieee

Conference_Titel

Database and Expert Systems Application, 2009. DEXA '09. 20th International Workshop on

Conference_Location

Linz

ISSN

1529-4188

Print_ISBN

978-0-7695-3763-4

Type

conf

DOI

10.1109/DEXA.2009.73

Filename

5337105

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=2456961