Accuracy versus cost in distributed data mining
Institution: | Oregon State University |
---|---|
Department: | Computer Science |
Degree: | MS |
Year: | 2007 |
Keywords: | data mining; Data mining – Economic aspects |
Record ID: | 1799761 |
Full text PDF: | http://hdl.handle.net/1957/6226 |
A basic tradeoff to consider when designing a distributed data-mining framework is the need for a compromise between the cost of communication and computation resources and the accuracy of the mining results. This is essentially a decision of whether it is more efficient to communicate all of the data to a central site for analysis, possibly increasing the accuracy of the results, or is it more efficient to mine the data locally at each of the remote sites and then combine the results, possibly reducing the use of communication and computation resources. This research attempts the design, analysis, and implementation of an efficient distributed and cumulative learning algorithm with performance guarantees that are provable relative to its centralized or batch counterparts for knowledge acquisition from distributed data sources that will address this tradeoff. This thesis also develops a methodical mathematical framework to describe this type of tradeoff, describes the reduction of the problem to a constrained optimization problem, and demonstrates techniques to balance cost and accuracy levels.