AbstractsBiology & Animal Science

Cell States and Cell Fate: Statistical and Computational Models in (Epi)Genomics

by Daniel Fernandez




Institution: Harvard University
Department:
Degree: PhD
Year: 2015
Keywords: Statistics; Biology, Bioinformatics; Biology, Genetics
Record ID: 2058225
Full text PDF: http://nrs.harvard.edu/urn-3:HUL.InstRepos:14226043


Abstract

This dissertation develops and applies several statistical and computational methods to the analysis of Next Generation Sequencing (NGS) data in order to gain a better understanding of our biology. In the rest of the chapter we introduce key concepts in molecular biology, and recent technological developments that help us better understand this complex science, which, in turn, provide the foundation and motivation for the subsequent chapters. In the second chapter we present the problem of estimating gene/isoform expression at the allelic level, and different models to solve this problem. First, we describe the observed data and the computational workflow to process the data. Next, we propose frequentist and bayesian models motivated by the central dogma of molecular biology and the data generating process (DGP) for RNA-Seq. We develop EM and Gibbs sampling approaches to estimate gene and transcript-specic expression from our proposed models. Finally, we present the performance of our models in simulations and we end with the analysis of experimental RNA-Seq data at the allelic level. In the third chapter we present our paired factorial experimental design to study parentally biased gene/isoform expression in the mouse cerebellum, and dynamic changes of this pattern between young and adult stages of cerebellar development. We present a bayesian variable selection model to estimate the difference in expression between the paternal and maternal genes, while incorporating relevant factors and its interactions into the model. Next, we apply our model to our experimental data, and further on we validate our predictions using pyrosequencing follow-up experiments. We subsequently applied our model to the pyrosequencing data across multiple brain regions. Our method, combined with the validation experiments, allowed us to find novel imprinted genes, and investigate, for the first time, imprinting dynamics across brain regions and across development. In the fourth chapter we move from the controlled-experiments in mouse isogenic lines to the highly variant world of human genetics in observational studies. In this chapter we introduce a Bayesian Regression Allelic Imbalance Model, BRAIM, that estimates the imbalance coming from two major sources: cis-regulation and imprinting. We model the cis-effect as an additive effect for the heterozygous group and we model the parent-of-origin detect with a latent variable that indicates to which parent a given allele belongs. Next, we show the performance of the model under simulation scenarios, and finally we apply the model to several experiments across multiple tissues and multiple individuals. In the fifth chapter we characterize the transcriptional regulation and gene expression of in-vitro Embryonic Stem Cells (ESCs), and two-related in-vivo cells; the Inner Cell Mass (ICM) tissue, and the embryonic tissue at day 6.5. Our objective is two fold. First we would like to understand the differences in gene expression between the ESCs and their in-vivo counterpart from where these…