DocumentCode :
1992443
Title :
ADD: Arabic duplicate detector - a duplicate detection data cleansing tool
Author :
Haraty, R.A. ; Varjabedian, R.
Author_Institution :
Comput. Sci. Program, Lebanese American Univ., Beirut, Lebanon
fYear :
2003
fDate :
14-18 July 2003
Firstpage :
137
Abstract :
Summary form only given. Data mining is a relatively new term; it was introduced in the 1990s. Data mining is the process of extracting useful information from huge amounts of data. It is sometimes referred to as "data discovery" or "knowledge discovery" in databases. What exactly defines useful information depends on the goal that data mining was for in the first place. Useful information can be used to increase revenue and to cut costs. It can also be used for the purpose of research. Advances in hardware and software in the late 1990s made data centralizing possible. Data centralizing is also called "data warehousing" or "data warehouse for the centralized data". With the process of data centralization came a very important issue, the quality of the data that has been centralized, since centralization includes the joining of multiple data sources. The data given as an input for the data mining process should be of high quality in order for the results of the data mining process to be accurate and reliable. Before data could be mined to extract useful information, it goes through a process called data cleansing. This process is as old as the word "data" itself; however, the term regained significance in the 1990s. Data cleansing involves several steps and processes that include one or more algorithms. We address one important step, which is duplicate data detection. We present a duplicate detection method called the efficient k-way sorting method. We also present a tool called Arabic duplicate detection, which is based on our method and is tailored for Arabic data.
Keywords :
data mining; data warehouses; knowledge based systems; natural languages; sorting; ADD; Arabic data; Arabic duplicate detector; data centralization; data cleansing tool; data discovery; data mining; data quality; data warehousing; duplicate data detection; efficient k-way sorting method; knowledge based-systems; knowledge discovery; multiple data source; Costs; Data mining; Detectors; Hardware; Large Hadron Collider; Sorting;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computer Systems and Applications, 2003. Book of Abstracts. ACS/IEEE International Conference on
Conference_Location :
Tunis, Tunisia
Print_ISBN :
0-7803-7983-7
Type :
conf
DOI :
10.1109/AICCSA.2003.1227569
Filename :
1227569
Link To Document :
بازگشت