Statistical methods for motif hit enrichment in DNA sequences

by Wolfgang Kopp

Institution: Freie Universitt Berlin
Year: 2017
Posted: 02/01/2018
Record ID: 2153586
Full text PDF: http://edocs.fu-berlin.de/diss/receive/FUDISS_thesis_000000104837


In this thesis, we discuss methods for analyzing the non-coding sequence of the genome(e.g promoters) with respect to the identification and enrichment of transcription factorbinding sites (TFBSs), as they are related to gene regulation. The identification of pu-tative TFBSs is based on the log-likelihood ratio between a TF motif, which describesthe binding affinity of a TF towards the DNA, and a background model, which is im-plemented by an order-d Markov models with d 0, in conjunction with a pre-definedlog-likelihood ratio threshold. Chapter 2 reviews algorithms for computing the falsepositive probability of calling motif hits for a given threshold. As putative TFBSs canself-overlap one another, which affects the enrichment test of the number of TFBSs, wediscuss the quantification of overlapping TFBS predictions in Chapter 3. In Chapter 4,we discuss a compound Poisson model for modeling the distribution of the number ofTFBSs in both strands of the DNA sequence, which represents an extension of Papeet al. [36]. The main advance of our model regards the use of newly derived princi-pal overlapping hit probabilities, which are motivated by the discussion of principalperiods in Reinert et al. [41], as well as by facilitating the use higher-order Markovmodels for the background. In Chapter 5 we discuss a novel Markov model which isutilized to determine the probability of a TFBS occurrence that does not overlap a previ-ous TFBS occurrences, termed clump start probability, which mark the beginning of aclump. The resulting clump start probability then serves as an important building blockfor the subsequent Chapter 6. Finally, in Chapter 6 we present a novel combinatorialmodel for the distribution of the number of motif hit. To that end, we efficiently sumup the probabilities of all realizations of placing x TFBSs in a finite-length sequenceof length N. We systematically compared the accuracy of the combinatorial model,the compound Poisson model and the binomial model. An implementation of the algo-rithms that were discussed in this thesis is provided as an R package that is available athttps://github.com/wkopp/mdist. In dieser Dissertation beschftigen wir uns mit der statistischen Analyse von nicht-kodierenden Segmenten des Genoms. Insbesondere betrachten wir Verfahren zur Identifikation und Anreicherungsanalyse von Transkriptionsfaktorbindungsstellen (TFBSen)in DNA Segmenten (z.B. in Promotoren), da das Binden von Transkriptionsfaktorenregulatorisch auf die Geneexpression benachbarter Gene wirkt. Die Identifikation vonTFBSen basiert auf dem Log-likelihood Verhltnis zwischen einem bekanntem Tran-skriptionsfaktormotiv, welches die DNA Bindungsaffinitt des Transkriptionsfaktorsbeschreibt, und einem Hintergrundmodel, z.B. einem Markov Model der Ordnung d,unter Verwendung eines festgelegten Schwellwerts. Kapitel 2 beschreibt die Berech-nung der falsch-positiv Wahrscheinlichkeit fr den gewhlten Schwellwert. Da die Iden-tifikation von TFBSen zu