Title :
Optimizing SQL Queries over Text Databases
Author :
Jain, Alpa ; Doan, AnHai ; Gravano, Luis
Author_Institution :
Dept. of Comput. Sci., Columbia Univ., New York, NY
Abstract :
Text documents often embed data that is structured in nature, and we can expose this structured data using information extraction technology. By processing a text database with information extraction systems, we can materialize a variety of structured "relations," over which we can then issue regular SQL queries. A key challenge to process SQL queries in this text-based scenario is efficiency: information extraction is time-consuming, so query processing strategies should minimize the number of documents that they process. Another key challenge is result quality: in the traditional relational world, all correct execution strategies for a SQL query produce the same (correct) result; in contrast, a SQL query execution over a text database might produce answers that are not fully accurate or complete, for a number of reasons. To address these challenges, we study a family of select-project-join SQL queries over text databases, and characterize query processing strategies on their efficiency and - critically - on their result quality as well. We optimize the execution of SQL queries over text databases in a principled, cost-based manner, incorporating this tradeoff between efficiency and result quality in a user-specific fashion. Our large-scale experiments- over real data sets and multiple information extraction systems - show that our SQL query processing approach consistently picks appropriate execution strategies for the desired balance between efficiency and result quality.
Keywords :
SQL; data structures; query processing; text analysis; information extraction technology; query processing strategy; select-project-join SQL query optimization; text database processing; text document; Cities and towns; Computer science; Data mining; Diseases; Embedded computing; Large-scale systems; Materials science and technology; Query processing; Relational databases; Soil;
Conference_Titel :
Data Engineering, 2008. ICDE 2008. IEEE 24th International Conference on
Conference_Location :
Cancun
Print_ISBN :
978-1-4244-1836-7
Electronic_ISBN :
978-1-4244-1837-4
DOI :
10.1109/ICDE.2008.4497472