DocumentCode :
2158559
Title :
Precision and recall of GlOSS estimators for database discovery
Author :
Gravano, Luis ; García-Molina, Héctor ; Tomasic, Anthony
Author_Institution :
Dept. of Comput. Sci., Stanford Univ., CA, USA
fYear :
1994
fDate :
28-30 Sep 1994
Firstpage :
103
Lastpage :
106
Abstract :
Online information vendors and the Internet together offer thousands of text databases from which a user may choose for a given information need. This paper presents a framework for and analyses a solution to this problem, which we call the text-database discovery problem. Our solution is to build a service that can suggest potentially good databases to search. A user´s query goes through two steps: first, the query is presented to the GlOSS server (Glossary-Of-Servers Server) to select a set of promising databases to search. Secondly, the query is actually evaluated in the chosen databases. GlOSS gives a hint of what databases might be useful for the user´s query, based on word-frequency information for each database. This information indicates how many documents in each database actually contain a keyword, for each field designator. To evaluate the set of databases that GlOSS returns for a given query, we present a framework based on the precision and recall metrics of information retrieval theory. We define metrics for the text-database discovery problem. We further extend our framework by offering different definitions for a “relevant database”. We have performed experiments using query traces from the FOLIO library information retrieval system, involving six databases available through FOLIO. The results obtained for different variants of GlOSS are very promising. Even though GlOSS keeps a small amount of information about the contents of the available databases, this information proved to be sufficient to produce very useful hints on where to search
Keywords :
file servers; information retrieval; online front-ends; FOLIO library information retrieval system; GlOSS estimators; Glossary-Of-Servers Server; database selection; field designator; information need; information retrieval theory; keyword; precision metric; query evaluation; recall metric; relevant database; search hints; text-database discovery problem; word-frequency information; Computer science; Data mining; Distributed databases; Frequency; Information retrieval; Internet; Libraries; Terminology;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel and Distributed Information Systems, 1994., Proceedings of the Third International Conference on
Conference_Location :
Austin, TX
Print_ISBN :
0-8186-6400-2
Type :
conf
DOI :
10.1109/PDIS.1994.331726
Filename :
331726
Link To Document :
بازگشت