DocumentCode :
168290
Title :
The anatomy of a search and mining system for digital humanities
Author :
Harris, M. ; Levene, M. ; Zhang, D. ; Levene, D.
Author_Institution :
Dept. of Comput. Sci., Univ. of London, London, UK
fYear :
2014
fDate :
8-12 Sept. 2014
Firstpage :
165
Lastpage :
168
Abstract :
Samtla (Search And Mining Tools with Linguistic Analysis) is an online integrated research environment designed in collaboration with historians and linguists to facilitate the study of digitised texts written in any language. It currently supports the research of two corpora: the Genizah collection held by the Taylor-Schechter Genizah Research Unit in Cambridge University, and a collection of Aramaic incantation texts from late antiquity. In contrast to standard search engines and text mining systems that rely on the bag-of-words representation of text, Samtla provides the retrieval and discovery of fuzzy text patterns/motifs (aka “formulae” to historians), which is achieved through applying a character-based n-gram statistical language model built on top of a powerful generalised suffix tree data structure. This paper brie y describes the major components of Samtla and their underlying techniques.
Keywords :
data mining; fuzzy set theory; linguistics; natural language processing; text analysis; tree data structures; Aramaic incantation text collection; Cambridge University; Genizah collection; Samtla; Taylor-Schechter Genizah Research Unit; character-based n-gram statistical language model; digital humanities; digitised texts; fuzzy text motif discovery; fuzzy text motif retrieval; fuzzy text pattern discovery; fuzzy text pattern retrieval; generalised suffix tree data structure; late antiquity; online integrated research environment; search and mining tool with linguistic analysis; Collaboration; Communities; Computational modeling; Data models; Educational institutions; Mathematical model; Text mining; Collaborative Search; Digital Humanities; Sequence Alignment; Statistical Language Model; Suffix Tree;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Digital Libraries (JCDL), 2014 IEEE/ACM Joint Conference on
Conference_Location :
London
Type :
conf
DOI :
10.1109/JCDL.2014.6970163
Filename :
6970163
Link To Document :
بازگشت