• DocumentCode
    710005
  • Title

    Character gazetteer for Named Entity Recognition with linear matching complexity

  • Author

    Dlugolinsky, Stefan ; Giang Nguyen ; Laclavik, Michal ; Seleng, Martin

  • Author_Institution
    Inst. of Inf., Bratislava, Slovakia
  • fYear
    2013
  • fDate
    15-18 Dec. 2013
  • Firstpage
    361
  • Lastpage
    365
  • Abstract
    A large amount of unstructured data is produced daily through numerous media around us. Despite that computer systems are becoming more powerful, even the commodity hardware, processing of such data and gaining useful information in time efficient manner remains a problem. One of the domains in unstructured data processing is Natural Language Processing (NLP). NLP covers areas like information extraction, machine translation, word sense disambiguation, automated question answering, etc. All of these areas require fast and precise Named Entity Recognition (NER), which is not a trivial task because of the processed data size and heterogeneity. Our effort in this research area is to provide fast tokenization and precise NER with linear complexity. In this paper, we present a character gazetteer with linear tokenization as well as NER and compare its two tree data structure representations; i.e. multiway tree implemented by hash maps and first child-next sibling binary tree. Our measurements shows that one outperforms the other in processing time, while the other outperforms it in memory consumption efficiency.
  • Keywords
    computational complexity; natural language processing; pattern matching; tree data structures; NER; NLP; automated question answering; character gazetteer; computer systems; first child-next sibling binary tree; hash maps; information extraction; linear matching complexity; linear tokenization; machine translation; memory consumption efficiency; muItiway tree; named entity recognition; natural language processing; tree data structure representations;; unstructured data; unstructured data processing; word sense disambiguation; Complexity theory; Data mining; Electronic publishing; Encyclopedias; Internet; Logic gates; gazetteer; named entity recognition; natural language processing; text processing; tokenization;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Information and Communication Technologies (WICT), 2013 Third World Congress on
  • Conference_Location
    Hanoi
  • Type

    conf

  • DOI
    10.1109/WICT.2013.7113096
  • Filename
    7113096