Higher-level Analysis of RNA-Seq Experiment: Multiple Data Sets and Multiple Genes

by Bin Zhuo

Institution: Oregon State University
Year: 2016
Keywords: RNA-Seq; Gene expression  – Statistical methods
Posted: 02/05/2017
Record ID: 2098397
Full text PDF: http://hdl.handle.net/1957/59543


Differential expression (DE) analysis is a key task in gene expression study, because it uncovers the association between expression levels of a gene and the covariates of interest. This dissertation pertains to two particular aspects of DE analysis—identifying stably expressed genes for count normalization and accounting for correlation between DE test statistics in gene-set test. RNA-Sequencing (RNA-Seq) has become the tool of choice for measuring gene expression over the past few years, and data generated from RNA-Seq experiments are the focus of this thesis. Identifying stably expressed genes is useful for count normalization and DE analysis. We examined RNA-Seq data on 211 biological samples from 24 different experiments conducted by different labs, and identified genes that are stably expressed across samples, treatment conditions, and experiments. We fit a Poisson log-linear mixed-effect model to the count data, and decomposed the total variance into between-sample, between-treatment and between-experiment variance components. The variance component analysis that we explore here is a first step towards understanding the sources and nature of the RNA-Seq count variation. The stability ranking of genes, when quantified by a numerical stability measure, is dependent on several factors: the background sample set and the reference gene set used for count normalization, the technology used to measure gene expression, and the specific stability measure. Since DE is measured by relative frequencies, we argue that DE is a relative concept. We advocate using an explicit reference gene set for count normalization to improve interpretability of DE results, and recommend using a common reference gene set when analyzing multiple RNA-Seq experiments to avoid potential inconsistent conclusions. We investigate the relationship between correlation among test statistics and the correlation of underlying observed data. For false discovery control (FDR) procedures and gene-set tests, pooling DE test statistics together is a frequently used idea and the correlation among test statistics needs to be taken into account. The sample correlation of observed data is often used to approximate the test statistics correlation. We show, however, that such an approximation is only valid under limited settings. In particular, we derive a formula for correlation between test statistics when they take a specific form, and as a special case, we present the exact expression of test-statistic correlation for equal-variance two-sample t-test statistic under bivariate normal assumption. We conclude that test-statistic correlation is weaker than the correlation of underlying observed data (normally distributed) in the context of equal-variance two-sample t-test. Competitive gene-set test is a widely used tool for interpreting high-throughput biological data, such as gene expression and proteomics data. It aims at testing categories of genes for enriched association signals in a list of genes inferred from genome-wide data. Most conventional… Advisors/Committee Members: Di, Yanming (advisor), Emerson, Sarah (committee member).