Title :
Information extraction from scientific paper using rhetorical classifier
Author :
Khodra, Masayu Leylia ; Widyantoro, Dwi H. ; Aziz, E.A. ; Bambang, Riyanto Trilaksono
Author_Institution :
Sch. of Electr. Eng., Bandung Inst. of Technol., Bandung, Indonesia
Abstract :
Time constraints often lead a reader of scientific paper to read only the title and abstract of the paper, but reading these parts is often ineffective. This study aims to extract information automatically in order to help the readers get structured information from a scientific paper. The information extraction is done by rhetorical classification of each sentence in a scientific paper. Rhetoric information is the intention to be conveyed to the reader by the author of the paper. This research used corpus-based approach to build rhetorical classifier. Since there was a lack of rethorical corpus, we constructed our own corpus, which is a collection of sentences that have been labeled with rhetorical information. Each sentence represented as a vector of content, location, citation, and meta-discourses features. This collection of feature vectors is used to build rhetorical classifiers by using machine learning techniques. Experiments were conducted to select the best learning techniques for rhetorical classifier. Training set consists of 7239 labeled sentences, and the testing set consists of 3638 labeled sentences. We used WEKA (Waikato Environment for Knowledge Analysis) and LibSVM libraries. Learning techniques being considered were Naive Bayes, C4.5, Logistic, Multi-Layer Perceptron, PART, Instance-based Learning, and Support Vector Machines (SVM). The best performers are the SVM and Logistic classifier with accuracy of 0.51. By applying one-against-all strategy, the SVM accuracy can be improved to 0.60.
Keywords :
Bayes methods; information retrieval; learning (artificial intelligence); multilayer perceptrons; pattern classification; support vector machines; C4.5 learning technique; LibSVM libraries; PART; Waikato environment for knowledge analysis; corpus-based approach; feature vectors; information extraction; instance-based learning; logistic learning technique; machine learning techniques; multilayer perceptron; naive Bayes; rhetoric information; rhetorical classifier; scientific paper; sentence rhetorical classification; support vector machines; Accuracy; Data mining; Feature extraction; Logistics; Machine learning; Support vector machines; Training; SVM classifier; information extraction; rhetorical classifier; rhetorical corpus; scientific paper;
Conference_Titel :
Electrical Engineering and Informatics (ICEEI), 2011 International Conference on
Conference_Location :
Bandung
Print_ISBN :
978-1-4577-0753-7
DOI :
10.1109/ICEEI.2011.6021634