DocumentCode :
2506572
Title :
Querying text databases for efficient information extraction
Author :
Agichtein, Eugene ; Gravano, Luis
Author_Institution :
Columbia Univ., USA
fYear :
2003
fDate :
5-8 March 2003
Firstpage :
113
Lastpage :
124
Abstract :
A wealth of information is hidden within unstructured text. This information is often best exploited in structured or relational form, which is suited for sophisticated query processing, for integration with relational databases, and for data mining. Current information extraction techniques extract relations from a text database by examining every document in the database, or use filters to select promising documents for extraction. The exhaustive scanning approach is not practical or even feasible for large databases, and the current filtering techniques require human involvement to maintain and to adapt to new databases and domains. We develop an automatic query-based technique to retrieve documents useful for the extraction of user-defined relations from large text databases, which can be adapted to new domains, databases, or target relations with minimal human effort. We report a thorough experimental evaluation over a large newspaper archive that shows that we significantly improve the efficiency of the extraction process by focusing only on promising documents.
Keywords :
data mining; full-text databases; information filters; query processing; relational databases; text analysis; very large databases; automatic query-based technique; data mining; document retrieval; exhaustive scanning approach; information extraction techniques; information filters; newspaper archive; query processing; query text databases; relational databases; unstructured text; user-defined relations; Corporate acquisitions; Data mining; Government; Humans; Information filtering; Information filters; Information retrieval; Monitoring; Query processing; Relational databases;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Engineering, 2003. Proceedings. 19th International Conference on
Print_ISBN :
0-7803-7665-X
Type :
conf
DOI :
10.1109/ICDE.2003.1260786
Filename :
1260786
Link To Document :
بازگشت