AbstractsBiology & Animal Science

Intelligent Data Mining on Large-scale Heterogeneous Datasets and its Application in Computational Biology

by Chao Wu




Institution: University of Cincinnati
Department: Engineering and Applied Science: Computer Science and Engineering
Degree: PhD
Year: 2014
Keywords: Computer Science; Machine Learning; Clustering; Data Integration; Biclustering; Bioinformatics; Network-based Methodology
Record ID: 2056181
Full text PDF: http://rave.ohiolink.edu/etdc/view?acc_num=ucin1406880774


Abstract

Machine learning is a branch of generic artificial intelligence, which covers a wide range of learning topics. A variety of supervised and unsupervised models of machine learning/data mining have been applied extensively in biomedical informatics studies for knowledge discovery. Advantages of meta-analysis or data fusion have been discussed in many research domains. Specifically, growing data, information, and knowledge covering various dimensions of human development and diseases calls for efficient integrative and mining efforts to analyze such heterogeneous information simultaneously. In this dissertation, we present our work to extract hidden knowledge from data about the large-scale complex biological systems that usually involve heterogeneous entities and associations between them. First, we propose a biclustering algorithm to identify entities that may manifest cohesiveness within a subspace of conditions. We apply this algorithm to predict combinatorial regulation of transcription factors. We also extend the algorithm to generate 3-clusters in order to capture associations between different classes of entities. Second, we propose network-based approaches to predict drug repositioning candidates. These computational models utilize heterogeneous genomic and pharmacological information to generate potential drug repositioning candidates. We validate the approach using known indications before applying to predict new indications for existing drugs. Third, we study several statistical and computational strategies to generate overall significance of relationships between different biological entities. We apply this specifically to the problem of microRNA target ranking. We propose a framework that applies a series of data mining methods to prioritize entities in a heterogeneous network context. We also develop a workbench ToppMiR based on this framework to infer significant microRNAs and mRNA targets given a biological context.