A Caltech Library Service

Statistical Methods for Gene Differential Expression Analysis of RNA-Sequencing


Yi, Lynn Donglin (2019) Statistical Methods for Gene Differential Expression Analysis of RNA-Sequencing. Dissertation (Ph.D.), California Institute of Technology. doi:10.7907/0YE6-2217.


RNA-Sequencing ("RNA-Seq") is performed to measure gene expression, often to ask the question of what genes are differentially expressed across various biological conditions. Statistical methods have been used to model RNA-Seq quantifications in order to determine differential expression, and have traditionally be divided into gene-level methods and transcript-level methods. There has been little attempt to connect the statistical divide, although transcript expression and gene expression are biologically inextricably linked. In this thesis, we provide a case study of a comparative differential expression analysis, demonstrating that many differential expression events happen on the isoform-level, and that performing an analysis using only summarized gene quantifications would fail to capture these events. Furthermore, we develop statistical methods that unify the transcript-level and gene-level analysis. In bulk RNA-Seq, by using p-value aggregation methods, we are able to translate transcript-level results into gene-level results under a unified framework. For single cell RNA-Seq, we propose using multiple logistic regression, leveraging the high dimensionality of the data in order to determine if the transcript quantifications pertaining to a gene are able to constitute a linear discriminant for cell type. This method combines differential transcript expression analysis and differential gene expression analysis into a unified framework which we call “gene differential expression.” Lastly, we demonstrate that our methods could be used on transcript compatibility counts instead of transcript quantifications in order to bypass ambiguous read assignment and improve accuracy. We show that transcript compatibility counts obtained via transcriptome pseudoalignment are comparable in quantification accuracy to quantifications from genome alignment methods.

Item Type:Thesis (Dissertation (Ph.D.))
Subject Keywords:Bioinformatics, computational biology, biostatistics, genomics, gene expression, sequencing, RNA sequencing, differential expression, transcriptome, genome, transcript, pseudoalignment, equivalence class, transcript compatibility counts, kallisto
Degree Grantor:California Institute of Technology
Division:Biology and Biological Engineering
Major Option:Systems Biology
Thesis Availability:Public (worldwide access)
Research Advisor(s):
  • Pachter, Lior S.
Thesis Committee:
  • Chan, David C. (chair)
  • Thomson, Matthew
  • Pachter, Lior S.
  • Chandrasekaran, Venkat
Defense Date:15 April 2019
Funding AgencyGrant Number
Walter and Sylvia Treadway FundUNSPECIFIED
Record Number:CaltechTHESIS:10102018-143313907
Persistent URL:
Related URLs:
URLURL TypeDescription adapted for Chapter II adapted for Chapter III adapted for Chapter IV DOIManuscript adapted for Chapter V
Yi, Lynn Donglin0000-0003-4575-0158
Default Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:11226
Deposited By: Lynn Yi
Deposited On:23 May 2019 20:12
Last Modified:04 Oct 2019 00:23

Thesis Files

PDF (Thesis) - Final Version
See Usage Policy.

PDF (Supplementary Material for Chapter III) - Supplemental Material
See Usage Policy.

PDF (Supplementary Figures for Chapter IV) - Supplemental Material
See Usage Policy.


Repository Staff Only: item control page