• DocumentCode
    168290
  • Title

    The anatomy of a search and mining system for digital humanities

  • Author

    Harris, M. ; Levene, M. ; Zhang, D. ; Levene, D.

  • Author_Institution
    Dept. of Comput. Sci., Univ. of London, London, UK
  • fYear
    2014
  • fDate
    8-12 Sept. 2014
  • Firstpage
    165
  • Lastpage
    168
  • Abstract
    Samtla (Search And Mining Tools with Linguistic Analysis) is an online integrated research environment designed in collaboration with historians and linguists to facilitate the study of digitised texts written in any language. It currently supports the research of two corpora: the Genizah collection held by the Taylor-Schechter Genizah Research Unit in Cambridge University, and a collection of Aramaic incantation texts from late antiquity. In contrast to standard search engines and text mining systems that rely on the bag-of-words representation of text, Samtla provides the retrieval and discovery of fuzzy text patterns/motifs (aka “formulae” to historians), which is achieved through applying a character-based n-gram statistical language model built on top of a powerful generalised suffix tree data structure. This paper brie y describes the major components of Samtla and their underlying techniques.
  • Keywords
    data mining; fuzzy set theory; linguistics; natural language processing; text analysis; tree data structures; Aramaic incantation text collection; Cambridge University; Genizah collection; Samtla; Taylor-Schechter Genizah Research Unit; character-based n-gram statistical language model; digital humanities; digitised texts; fuzzy text motif discovery; fuzzy text motif retrieval; fuzzy text pattern discovery; fuzzy text pattern retrieval; generalised suffix tree data structure; late antiquity; online integrated research environment; search and mining tool with linguistic analysis; Collaboration; Communities; Computational modeling; Data models; Educational institutions; Mathematical model; Text mining; Collaborative Search; Digital Humanities; Sequence Alignment; Statistical Language Model; Suffix Tree;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Digital Libraries (JCDL), 2014 IEEE/ACM Joint Conference on
  • Conference_Location
    London
  • Type

    conf

  • DOI
    10.1109/JCDL.2014.6970163
  • Filename
    6970163