AbstractsComputer Science

Automatic and efficient data virtualization system for scientific datasets

by LI WENG




Institution: The Ohio State University
Department: Computer and Information Science
Degree: PhD
Year: 2006
Keywords: Computer Science
Record ID: 1779022
Full text PDF: http://rave.ohiolink.edu/etdc/view?acc_num=osu1154717945


Abstract

There are a number of reasons why efficient access and high performance processing on scientific datasets are challenging. First, scientific datasets are typically stored as binary or character flat-files. Second, data servers need to efficiently serve increasing number of clients and types of queries as more data come online. To address these issues, we concentrated on the following areas: 1) Realizing data virtualization through automatically generated data services over scientific datasets. 2) Supporting data analysis processing by means of SQL-3 query and aggregations for the data virtualization system. 3) Designing new techniques toward efficient execution of data analysis queries using space partitioned partial replicas. 4) Generalizing the functionalities of the replica selection module according to two significant extensions. 5) Exploring the performance optimization potential of multiple queries over massive datasets. In view of the first challenge, we have developed a meta-data descriptor and compiler-oriented Data Virtualization System. We designed a meta-data description language that is used for specifying low-level characteristics of datasets. a scientist could explore a subset of interest and apply complex processing over them using declarative SQL-3 query and aggregations. Compiler algorithms using meta-data descriptor and analyzing aggregations were developed for generating efficient data subsetting service and data aggregation service automatically. In view of the second challenge, we investigated one type of optimization techniques Partial Replication. We proposed and implemented a greedy algorithm based on a cost metric to choose a best combination of partial replicas. Moreover, to generalize the work into a more realistic environment setting, we extended it for range and aggregate queries with both of space partitioned and attribute partitioned partial replicas. They could be unevenly or uniformly stored across distributed storage units. Using a new cost metric, a composite replica selection algorithm comprising of a set of dynamic programming strategy and greedy strategies are devised to resolve this problem. Finally, we further explore the optimization potential of executing multiple queries over massive datasets. These techniques are implemented into a Replica Selection Module which is coupled tightly with the overall architecture of our Automatic Data Virtualization System.