Title :
A high-speed and large-scale dictionary matching engine for Information Extraction systems
Author :
Agarwal, K. ; Polig, Raphael
Author_Institution :
Austin Res. Lab., IBM Corp., Austin, TX, USA
Abstract :
Dictionary matching is a commonly used operation in Information Extraction (IE) systems. It involves matching a set of strings in a document against a dictionary of pre-defined patterns. In this paper, we describe a high performance and scalable hardware architecture to enable high throughput dictionary matching on very large dictionaries for text analytics applications. Our hardware accelerator employs a novel hashing based approach instead of commonly used deterministic finite automata (DFA) based algorithms. A limitation of the DFA based approaches is that they typically process one character every cycle, while the proposed hash based scheme can process a string token every cycle, thus achieving significantly higher processing throughput than the DFA based implementations. Our measurement results based on a prototype implementation on an Altera Stratix IV FPGA device indicate that our hardware dictionary matching engine can process typical document streams at a processing rate of ~1.5GB/s (~12 Gbps) while simultaneously allowing support for large dictionary sizes containing up to ~100K patterns, thus making it very useful for IE workload acceleration.
Keywords :
dictionaries; field programmable gate arrays; file organisation; information retrieval systems; string matching; text analysis; Altera Stratix IV FPGA device; DFA based algorithms; IE systems; IE workload acceleration; deterministic finite automata based algorithms; hardware accelerator; hardware dictionary matching engine; hashing based approach; high throughput dictionary matching; high-speed dictionary matching engine; information extraction system; large-scale dictionary matching engine; scalable hardware architecture; string matching; string token; text analytics applications; Arrays; Dictionaries; Field programmable gate arrays; Hardware; Pattern matching; Random access memory; Throughput; FPGA; dictionary matching; hardware acceleration; hashing; information extraction; pattern matching; string matching; text analytics;
Conference_Titel :
Application-Specific Systems, Architectures and Processors (ASAP), 2013 IEEE 24th International Conference on
Conference_Location :
Washington, DC
Print_ISBN :
978-1-4799-0494-5
DOI :
10.1109/ASAP.2013.6567551