AbstractsComputer Science

Topics in Data Mining: Pattern Enumeration, XML Key Inference and Big Data Query Optimization

by Jonny Daenen

Institution: Universiteit Hasselt
Year: 2016
Posted: 02/05/2017
Record ID: 2064166
Full text PDF: http://hdl.handle.net/1942/22025


In this work, we identify three challenging subtopics in regard to optimizing Big data mining workflows. First, we focus on pattern mining and investigate the problem of enumerating string patterns described by a context-free language. We derive guarantees on the delay between generated items when using a naive algorithm. Our results contribute to the foundational aspects of computer science and provide a basis for obtaining similar guarantees in more complex enumeration problems. The second topic remains in the domain of pattern mining: we study the pattern mining problem applied to XML keys. We discuss the complexity of several important decision problems and devise an algorithm for discovering XML keys from a given set of XML data. The presented algorithm leverages previous results from search space exploration and relational key mining and is experimentally validated. For our final topic, we shift our attention to Big data mining, where query engines answer questions about data that exceeds the capacity of traditional relational database systems. To construct answers within a reasonable amount of time, we focus on parallel evaluation. We present a two-tiered strategy for optimizing query plans for a collection of strictly guarded fragment queries. The nature of these queries allows for a low-cost MapReduce evaluation (in terms of total and net time) that takes up to two rounds per subquery. We provide an implementation in our system called Gumbo and extensively compare it to existing systems. Big Data; Data Mining; XML; Pattern Enumeration; Pattern Mining; MapReduce; Advisors/Committee Members: Neven, Frank, Tan, Tony, Van den Bussche, Jan.