AbstractsMathematics

Exact Inference Algorithms and Their Optimization in Bayesian Clustering

by Jukka Kohonen




Institution: University of Helsinki
Department: Department of Mathematics and Statistics
Year: 2015
Keywords: tilastotiede
Record ID: 1140108
Full text PDF: http://hdl.handle.net/10138/153447


Abstract

Clustering is a central task in computational statistics. Its aim is to divide observed data into groups of items, based on the similarity of their features. Among various approaches to clustering, Bayesian model-based clustering has recently gained popularity. Many existing works are based on stochastic sampling methods. This work is concerned with exact, exponential-time algorithms for the Bayesian model-based clustering task. In particular, we consider the exact computation of two summary statistics: the number of clusters, and pairwise incidence of items in the same cluster. We present an implemented algorithm for computing these statistics substantially faster than would be achieved by direct enumeration of the possible partitions. The method is practically applicable to data sets of up to approximately 25 items. We apply a variant of the exact inference method into graphical models where a given variable may have up to four parent variables. The parent variables can then have up to 16 value combinations, and the task is to cluster them and find combinations that lead to similar conditional probability tables. Further contributions of this work are related to number theory. We show that a novel combination of addition chains and additive bases provides the optimal arrangement of multiplications, when the task is to use repeated multiplication starting from a given number or entity, but only a certain kind of function of the successive powers is required. This arrangement speeds up the computation of the posterior distribution for the number of clusters. The same arrangement method can be applied to other multiplicative tasks, for example, in matrix multiplication. We also present new algorithmic results related to finding extremal additive bases. Before this work, the extremal additive bases were known up to length 23. We have computed them up to length 24 in the unrestricted case, and up to length 41 in the restricted case. Klusterointi on keskeinen laskennallisen tilastotieteen menetelmä. Klusteroinnissa samankaltaisia havaintoja ryhmitellään yhteen. Eri klusterointimenetelmiä on lukuisia; viime aikoina on yleistynyt bayesiläinen mallipohjainen klusterointi. Siihen on yleensä sovellettu stokastisia menetelmiä: kokeillaan useita erilaisia tapoja klusteroida samat havainnot, osin satunnaisesti, ja näin saadaan laskennallinen arvio oikeasta klusterointiratkaisusta. Tässä työssä sen sijaan tutkitaan bayesiläistä mallipohjaista klusterointia eksakteilla algoritmeilla, joiden aikavaativuus on eksponentiaalinen. Niillä voidaan laskea tarkka todennäköisyysjakauma kahdelle oikeaa klusterointiratkaisua kuvaavalle tunnusluvulle: klusterien lukumäärälle sekä sille, mitkä havaintoparit kuuluvat samaan klusteriin. Työssä toteutettu algoritmi laskee nämä tunnusluvut huomattavasti nopeammin kuin jos yksinkertaisesti käytäisiin läpi kaikki mahdolliset klusterointiratkaisut. Käytännössä menetelmää voi käyttää enintään noin 25 havainnon klusterointiin. Algoritmin muunnelmaa sovelletaan graafisiin…