TMAC: An automated text mining tool for construction of an annotated corpus to support protein-protein interaction information extraction

Author

Azzem, Rania Ahmed Abdel ; Seoud, Abul

Author_Institution

Dept. of Electr. Eng., El Fayoum Univ., Fayoum, Egypt

fYear

2010

fDate

2-4 Nov. 2010

Firstpage

75

Lastpage

79

Abstract

Extracting protein-protein interaction (PPI) from biomedical literatures is a meaningful topic in protein science. Annotated corpora are important to the development and evaluation of protein-protein interaction extraction systems. So it is important to construct a text mining tool for the annotation of any corpus for protein name and interaction events for the identification of interactions among proteins. In this paper we present a java package called the TMAC system. TMAC tagged protein names and interaction events in biomedical literatures based on a combination of carefully designed rules and a dictionary of protein names. TMAC is able to normalize the results of protein mentions and interaction events found by offering the appropriate database reference. TMAC is divided into two modules. The first module is the Name entity identification and normalization module. The second module is the interaction event tagger for the identification of words that will ensure the occurrence of the interaction. TMAC achieved an average of 85.2% precision, 76.7% recall for the protein identification process. TMAC achieved an average of 88.2% precision, 71.8% recall for the protein - protein interaction event identification process. TMAC is a flexible system. It could be used as a standalone application or can be incorporated in the workflow of a more general text mining system.

Keywords

biology computing; data mining; proteins; text analysis; Java package; TMAC system; annotated corpora; annotated corpus; automated text mining; biomedical literatures; protein identification; protein science; protein-protein interaction extraction systems; protein-protein interaction information extraction; text mining system; Abstracts; Databases; Dictionaries; Protein engineering; Proteins; Text mining; named entity recognition; protein normalization; text-mining;

fLanguage

English

Publisher

ieee

Conference_Titel

Computer Technology and Development (ICCTD), 2010 2nd International Conference on

Conference_Location

Cairo

Print_ISBN

978-1-4244-8844-5

Electronic_ISBN

978-1-4244-8845-2

Type

conf

DOI

10.1109/ICCTD.2010.5646069

Filename

5646069