Abstract :
As a popular search mechanism, keyword search has been applied to retrieve useful data in documents, texts, graphs, and even relational databases. However, so far, there is no work on keyword search over uncertain graph data even though the uncertain graphs have been widely used in many real applications, such as modeling road networks, influential detection in social networks, and data analysis on PPI networks. Therefore, in this paper, we study the problem of top-k keyword search over uncertain graph data. Following the similar answer definition for keyword search over deterministic graphs, we consider a subtree in the uncertain graph as an answer to a keyword query if 1) it contains all the keywords; 2) it has a high score (defined by users or applications) based on keyword matching; and 3) it has low uncertainty. Keyword search over deterministic graphs is already a hard problem as stated in [1], [2], [3]. Due to the existence of uncertainty, keyword search over uncertain graphs is much harder. Therefore, to improve the search efficiency, we employ a filtering-and-verification strategy based on a probabilistic keyword index, PKIndex. For each keyword, we offline compute path-based top-k probabilities, and attach these values to PKIndex in an optimal, compressed way. In the filtering phase, we perform existence, path-based and tree-based probabilistic pruning phases, which filter out most false subtrees. In the verification, we propose a sampling algorithm to verify the candidates. Extensive experimental results demonstrate the effectiveness of the proposed algorithms.
Keywords :
data analysis; indexing; information filtering; probability; query processing; trees (mathematics); PKIndex; PPI network; data analysis; data retrieval; deterministic graph; documents; efficient keyword search; false subtree filtering; filtering-and-verification strategy; influential detection; keyword matching; keyword query; path-based probabilistic pruning; path-based top-k probability; probabilistic keyword index; relational database; road network modeling; sampling algorithm; search efficiency; search mechanism; social network; texts; top-k keyword search; tree-based probabilistic pruning; uncertain graph data; uncertain graph subtree; Data mining; Graphs; Information retrieval; Ontologies; Search methods; Text mining; Uncertainty; Database; algorithm; graph data; uncertain data;