Title :
Building a Biomedical Tokenizer Using the Token Lattice Design Pattern and the Adapted Viterbi Algorithm
Author :
Barrett, Neil ; Weber-Jahnke, Jens
Author_Institution :
Dept. of Comput. Sci., Univ. of Victoria, Victoria, BC, Canada
Abstract :
Proper tokenization of biomedical text is a non-trivial problem. Problematic characteristics of current biomedical tokenizers include idiosyncratic tokenizer output and poor tokenizer extensibility and reuse. To address these problematic characteristics, we identified and completed a novel tokenizer design pattern for biomedical tokenizers. We separated a tokenizer into three components: a token lattice and lattice constructor, a best lattice-path chooser and token transducers. Token transducers create tokens from text. These tokens are assembled into a token lattice by the lattice constructor. The best path (tokenization) is selected from the token lattice, tokenizing the text. We applied our design pattern and our token transducer identification guidelines in the creation of a tokenizer for SNOMED CT concept descriptions and compared our tokenizer to three other tokenizer methods. Medpost and our adapted Viterbi tokenizer perform best with a 90.1% and 93.7% accuracy respectively.
Keywords :
medicine; natural language processing; text analysis; adapted Viterbi algorithm; biomedical text; biomedical tokenizer; idiosyncratic tokenizer; lattice constructor; lattice-path chooser; token lattice design pattern; token transducer identification guidelines; tokenization; tokenizer design pattern; tokenizer extensibility; Guidelines; Helium; Lattices; Software; Tagging; Transducers; Viterbi algorithm; Machine Learning; Medicine and Science; Natural Language Processing; Patterns;
Conference_Titel :
Machine Learning and Applications (ICMLA), 2010 Ninth International Conference on
Conference_Location :
Washington, DC
Print_ISBN :
978-1-4244-9211-4
DOI :
10.1109/ICMLA.2010.76