Title :
Certain Reduction Rules Useful for De-Duplication Algorithm of Indian Demographic Data
Author :
Kaushik, Vandna Dixit ; Bendale, Amit ; Nigam, Abhishek ; Gupta, Puneet
Author_Institution :
Dept. of Comput. Sci. & Eng, Hartcourt Butler Technol. Inst., Kanpur, India
Abstract :
This paper proposes certain rules which helps to design efficient algorithm for de-duplication which is based on Indian demographic information that containing two name strings, viz. Given Name and Surname, of individuals. Rules help to reduce all name strings to generic name strings. A bin is formed by the generic name which contains all name strings and their Ids. Thus, the database with demographic information consists of an array of bins and each bin is represented by a singly linked list. At the time of query, top n best matches are determined by searching all neighbouring bins of the reduced query name strings. Performance of the rules has been analyzed on a large demographic database of 5,00,000 individuals. It is found that these proposed rules help to reduce the name strings by more than 90%.
Keywords :
database management systems; demography; government data processing; query processing; string matching; text analysis; Given Name; Indian demographic data; Indian demographic information; Surname; bin array; deduplication algorithm; generic name strings; large demographic database; query name string reduction; reduction rules; Algorithm design and analysis; Approximation algorithms; Arrays; Computer science; Databases; Electronic mail; Speech; De-duplication; Demographic Information; Distance Matrix; Phonetics;
Conference_Titel :
Advanced Computing & Communication Technologies (ACCT), 2014 Fourth International Conference on
Conference_Location :
Rohtak
DOI :
10.1109/ACCT.2014.85