Title :
Identification of Investigator Name Zones Using SVM Classifiers and Heuristic Rules
Author :
Jongwoo Kim ; Le, Daniel X. ; Thoma, George R.
Author_Institution :
Nat. Libr. of Med., Bethesda, MD, USA
Abstract :
The research reported in biomedical articles often involves large numbers of investigators at different institutions. To properly credit these investigators, an article\´s authors frequently name them together in some part of the article. These Investigator Names (IN) now constitute a required field in the MEDLINE® citation for the article. The automated extraction of these names is implemented in a system developed by a research group at the U.S. National Library of Medicine, consisting of three modules based on Support Vector Machine (SVM) classifiers and heuristic rules. The SVM classifiers label text blocks ("zones") that possibly contain Investigator Names, and the heuristic rules identify the actual zones. We collect eleven sets of word lists to train and test the classifiers, each set containing 100 to 56,000 words. Experimental results on online biomedical articles show a Precision of 0.90, 0.95 Recall, 0.92 F-Measure, and 0.99 Accuracy.
Keywords :
bioinformatics; citation analysis; pattern classification; support vector machines; text analysis; MEDLINE citation; SVM classifier label text blocks; SVM classifiers; US National Library of Medicine; article authors; automated extraction; classifier testing; classifier training; heuristic rules; investigator name zone identification; online biomedical articles; support vector machine classifiers; word lists; Accuracy; Classification algorithms; Data mining; Labeling; Libraries; Merging; Support vector machines; Investigator Names; MEDLINE; Support Vector Machine; bibliographic information; heuristic rules; labeling;
Conference_Titel :
Document Analysis and Recognition (ICDAR), 2013 12th International Conference on
Conference_Location :
Washington, DC
DOI :
10.1109/ICDAR.2013.35