AbstractsEconomics

Accuracy versus cost in distributed data mining

by Stephanie Deutschman




Institution: Oregon State University
Department: Computer Science
Degree: MS
Year: 2007
Keywords: data mining; Data mining  – Economic aspects
Record ID: 1799761
Full text PDF: http://hdl.handle.net/1957/6226


Abstract

A basic tradeoff to consider when designing a distributed data-mining framework is the need for a compromise between the cost of communication and computation resources and the accuracy of the mining results. This is essentially a decision of whether it is more efficient to communicate all of the data to a central site for analysis, possibly increasing the accuracy of the results, or is it more efficient to mine the data locally at each of the remote sites and then combine the results, possibly reducing the use of communication and computation resources. This research attempts the design, analysis, and implementation of an efficient distributed and cumulative learning algorithm with performance guarantees that are provable relative to its centralized or batch counterparts for knowledge acquisition from distributed data sources that will address this tradeoff. This thesis also develops a methodical mathematical framework to describe this type of tradeoff, describes the reduction of the problem to a constrained optimization problem, and demonstrates techniques to balance cost and accuracy levels.