• DocumentCode
    2770977
  • Title

    Probabilistic Similarity Query on Dimension Incomplete Data

  • Author

    Cheng, Wei ; Jin, Xiaoming ; Sun, Jian-Tao

  • Author_Institution
    Sch. of Software, Tsinghua Univ., Beijing, China
  • fYear
    2009
  • fDate
    6-9 Dec. 2009
  • Firstpage
    81
  • Lastpage
    90
  • Abstract
    Retrieving similar data has drawn many research efforts in the literature due to its importance in data mining, database and information retrieval. This problem is challenging when the data is incomplete. In previous research, data incompleteness refers to the fact that data values for some dimensions are unknown. However, in many practical applications (e.g., data collection by sensor network under bad environment), not only data values but even data dimension information may also be missing, which will make most similarity query algorithms infeasible. In this work, we propose the novel similarity query problem on dimension incomplete data and adopt a probabilistic framework to model this problem. For this problem, users can give a distance threshold and a probability threshold to specify their retrieval requirements. The distance threshold is used to specify the allowed distance between query and data objects and the probability threshold is used to require that the retrieval results satisfy the distance condition at least with the given probability. Instead of enumerating all possible cases to recover the missed dimensions, we propose an efficient approach to speed up the retrieval process by leveraging the inherent relations between query and dimension incomplete data objects. During the query process, we estimate the lower/upper bounds of the probability that the query is satisfied by a given data object, and utilize these bounds to filter irrelevant data objects efficiently. Furthermore, a probability triangle inequality is proposed to further speed up query processing. According to our experiments on real data sets, the proposed similarity query method is verified to be effective and efficient on dimension incomplete data.
  • Keywords
    data mining; query processing; data collection; data mining; data retrieval; dimension incomplete data; information retrieval; probabilistic similarity query; query processing; Asia; Costs; Data mining; Databases; Filters; Information retrieval; Multidimensional systems; Query processing; Sun; Upper bound;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining, 2009. ICDM '09. Ninth IEEE International Conference on
  • Conference_Location
    Miami, FL
  • ISSN
    1550-4786
  • Print_ISBN
    978-1-4244-5242-2
  • Electronic_ISBN
    1550-4786
  • Type

    conf

  • DOI
    10.1109/ICDM.2009.72
  • Filename
    5360233