• DocumentCode
    2454390
  • Title

    Building a Biomedical Tokenizer Using the Token Lattice Design Pattern and the Adapted Viterbi Algorithm

  • Author

    Barrett, Neil ; Weber-Jahnke, Jens

  • Author_Institution
    Dept. of Comput. Sci., Univ. of Victoria, Victoria, BC, Canada
  • fYear
    2010
  • fDate
    12-14 Dec. 2010
  • Firstpage
    473
  • Lastpage
    478
  • Abstract
    Proper tokenization of biomedical text is a non-trivial problem. Problematic characteristics of current biomedical tokenizers include idiosyncratic tokenizer output and poor tokenizer extensibility and reuse. To address these problematic characteristics, we identified and completed a novel tokenizer design pattern for biomedical tokenizers. We separated a tokenizer into three components: a token lattice and lattice constructor, a best lattice-path chooser and token transducers. Token transducers create tokens from text. These tokens are assembled into a token lattice by the lattice constructor. The best path (tokenization) is selected from the token lattice, tokenizing the text. We applied our design pattern and our token transducer identification guidelines in the creation of a tokenizer for SNOMED CT concept descriptions and compared our tokenizer to three other tokenizer methods. Medpost and our adapted Viterbi tokenizer perform best with a 90.1% and 93.7% accuracy respectively.
  • Keywords
    medicine; natural language processing; text analysis; adapted Viterbi algorithm; biomedical text; biomedical tokenizer; idiosyncratic tokenizer; lattice constructor; lattice-path chooser; token lattice design pattern; token transducer identification guidelines; tokenization; tokenizer design pattern; tokenizer extensibility; Guidelines; Helium; Lattices; Software; Tagging; Transducers; Viterbi algorithm; Machine Learning; Medicine and Science; Natural Language Processing; Patterns;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Machine Learning and Applications (ICMLA), 2010 Ninth International Conference on
  • Conference_Location
    Washington, DC
  • Print_ISBN
    978-1-4244-9211-4
  • Type

    conf

  • DOI
    10.1109/ICMLA.2010.76
  • Filename
    5708873