DocumentCode :
169990
Title :
A Multi-level Funneling Approach to Data Provenance Reconstruction
Author :
Aierken, Ailifan ; Davis, Delmar B. ; Qi Zhang ; Gupta, Kunal ; Wong, Alexander ; Asuncion, Hazeline U.
Author_Institution :
Sch. of Sci., Technol., Eng. & Math., Univ. of Washington Bothell, Bothell, WA, USA
Volume :
2
fYear :
2014
fDate :
20-24 Oct. 2014
Firstpage :
71
Lastpage :
74
Abstract :
When data are retrieved from a file storage system or the Internet, is there information about their provenance (i.e., their origin or history)? It is possible that data could have been copied from another source and then transformed. Often, provenance is not readily available for data sets created in the past. Solving such a problem is the motivation behind the 2014 Provenance Reconstruction Challenge. This challenge is aimed at recovering lost provenance for two data sets: one data set (WikiNews articles) in which a list of possible sources has been provided, and another data set (files from GitHub repositories) in which the file sources are not provided. To address this challenge, we present a multi-level funneling approach to provenance reconstruction, a technique that incorporates text processing techniques from different disciplines to approximate the provenance of a given data set. We built three prototypes using this technique and evaluated them using precision and recall metrics. Our preliminary results indicate that our technique is capable of reconstructing some of the lost provenance.
Keywords :
Web sites; data analysis; meta data; text analysis; 2014 Provenance Reconstruction Challenge; GitHub repository; Internet; WikiNews articles; data history; data origin; data provenance reconstruction; data retrieval; file sources; file storage system; lost provenance recovery; metadata information; multilevel funneling approach; precision metrics; provenance information; recall metrics; semantic content; text processing technique; Image reconstruction; Information retrieval; Measurement; Natural language processing; Prototypes; Semantics; Vectors; data provenance reconstruction; longest common subsequence; semantic analysis; similarity metrics; topic modeling; vector space model;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
e-Science (e-Science), 2014 IEEE 10th International Conference on
Conference_Location :
Sao Paulo
Print_ISBN :
978-1-4799-4288-6
Type :
conf
DOI :
10.1109/eScience.2014.54
Filename :
6972100
Link To Document :
بازگشت