Author :
Atluri, Gowtham ; Dey, Sanjoy ; Fang, Gang ; Landman, Sean ; Paunic, Vanja ; Wang, Wen ; Steinbach, Michael ; Kumar, Vipin
Abstract :
There has been a dramatic increase in the quantity, quality, and types of advanced biomedical information available to individuals and their medical providers. These types of data include, but are not limited to, cell process information provided by DNA microarrays and RNA seq, genetic information in the form of Single Nucleotide Polymorphisms (SNPs), metabolomics data in terms of proteins and other metabolites, and structural and functional brain data from magnetic resonance imaging (MRI). Together with the increasing availability of clinical data from electronic medical records, this abundance of data has created the very real possibility of personalized medicine, i.e., using detailed biomedical, clinical, and environmental information about a person for a customized and more effective approach to patient care [11], [16], [3]. Achieving this goal requires identifying those features of the data that can distinguish not only between healthy or low risk subjects (controls) and diseased or high risk subjects (cases), but also among different subgroups of cases. These features are typically predictive patterns (biomarkers) that are associated with the disease or other phenotype of interest. Simple examples are a SNP that indicates a predisposition for a particular disorder or the presence of a protein or small molecule that signals the presence of cancer. These patterns can be directly useful in diagnosis, treatment or prevention, but equally as important; they can also provide insights into the underlying nature of the disease or related biomedical processes. Unfortunately, the lack of readily available, easy to use, and effective tools and techniques for finding trustworthy and useful markers is limiting progress in medical research and slowing the advent of personalized medicine [12], [4]. Several well-known challenges are responsible for this lack of progress. First, many times the large number of individual factors, e.g., hundreds of thousands or millions of SNPs- makes it difficult to find statistically significant single markers without large numbers of samples. In addition, the complexity of the diseases being considered also makes it unlikely that meaningful predictive patterns can be based on single factors. Thus, techniques for extracting meaningful associations must be able to discover combinations of factors that show a significant association with a disease phenotype even when single factors have little or no association. However, search for such high order interactions leads to increased computational complexity, since the number of possible patterns increase exponentially with pattern length. Perhaps an even more serious challenge is that of multiple hypothesis testing which results from the enormous number of potential patterns (hypotheses) and the resulting increased probability of mistaking spurious patterns for real ones. Yet another complication is the heterogeneous nature of many diseases, i.e., patients with a particular disease may form different subgroups and predictive patterns appropriate for one subgroup may not apply to another. To more fully capture the broad range of factors responsible for complex disorders, it is necessary to undertake the difficult task of integrating diverse types of data from the same set of subjects or from accumulated biomedical knowledge, e.g., functional annotations of genes or proteins. Given the inability of current techniques to handle these challenges (computational complexity, statistical significance, heterogeneity, and data integration), it is no surprise the even when statistically significant patterns are found in one study, they are rarely reproduced in follow-up studies by different groups [13], [14], [5]. This talk will present our group´s recent research on pattern mining based approaches for addressing these challenges [8], [7], [9], [6], [10], [15], [2], [20], [17], [18], [1], [19].