DocumentCode :
983823
Title :
Hierarchical decision tree induction in distributed genomic databases
Author :
Bar-Or, Amir ; Keren, Daniel ; Schuster, Assaf ; Wolff, Ran
Author_Institution :
HP Labs., Cambridge, MA, USA
Volume :
17
Issue :
8
fYear :
2005
Firstpage :
1138
Lastpage :
1151
Abstract :
Classification based on decision trees is one of the important problems in data mining and has applications in many fields. In recent years, database systems have become highly distributed, and distributed system paradigms, such as federated and peer-to-peer databases, are being adopted. In this paper, we consider the problem of inducing decision trees in a large distributed network of genomic databases. Our work is motivated by the existence of distributed databases in healthcare and in bioinformatics, and by the emergence of systems which automatically analyze these databases, and by the expectancy that these databases will soon contain large amounts of highly dimensional genomic data. Current decision tree algorithms require high communication bandwidth when executed on such data, which are large-scale distributed systems. We present an algorithm that sharply reduces the communication overhead by sending just a fraction of the statistical data. A fraction which is nevertheless sufficient to derive the exact same decision tree learned by a sequential learner on all the data-in the network. Extensive experiments using standard synthetic SNP data show that the algorithm utilizes the high dependency among attributes, typical to genomic data, to reduce communication overhead by up to 99 percent. Scalability tests show that the algorithm scales well with both the size of the data set, the dimensionality of the data, and the size of the distributed system.
Keywords :
biology computing; data mining; decision trees; distributed algorithms; distributed databases; genetics; health care; medical information systems; pattern classification; statistical databases; bioinformatics; data mining; distributed algorithm; distributed genomic databases; distributed system paradigm; federated databases; healthcare; hierarchical decision tree induction; peer-to-peer databases; statistical data; Bioinformatics; Classification tree analysis; Data analysis; Data mining; Database systems; Decision trees; Distributed databases; Genomics; Medical services; Peer to peer computing; Index Terms- Data mining; classification.; decision trees; distributed algorithms;
fLanguage :
English
Journal_Title :
Knowledge and Data Engineering, IEEE Transactions on
Publisher :
ieee
ISSN :
1041-4347
Type :
jour
DOI :
10.1109/TKDE.2005.129
Filename :
1458706
Link To Document :
بازگشت