CaltechTHESIS
  A Caltech Library Service

Methods for Long Read RNA-Seq Transcriptomics

Citation

Loving Ngo, Rebekah Kiana (2025) Methods for Long Read RNA-Seq Transcriptomics. Dissertation (Ph.D.), California Institute of Technology. doi:10.7907/3nz8-3c83. https://resolver.caltech.edu/CaltechTHESIS:06012025-204136107

Abstract

While short read RNA-seq dominated the field for decades, long read RNA-seq is particularly useful for isoform-level expression analysis, genome annotation, detecting novelly splicing transcripts, identifying exact breakpoints in gene fusions, and discovering chimeric RNAs. Long read RNA-seq has rapidly scaled to the point of producing terabytes of data from a single set of experiments. Technological advances in RNA and DNA sequencing library preparation, chemistry used in the Oxford nanopores, and basecalling algorithms have reduced long read sequencing error rates to sub-1% error. Further, the cost of long read sequencing has dropped to about one hundred US dollars per human genome. These two factors have lead to the mass production of high-throughput, long read, and single-cell RNA-seq data. While recent tools for long read RNA-seq have been developed, they have not kept pace in scalability and accuracy with long read RNA-seq in the fashion that short read RNA-seq tools have met computational scalability and accuracy challenges. To address this, in this thesis, we leverage long k-mers and pseudoalignment for mapping and quantifying long reads in the novel algorithm implemented within lr-kallisto, which yields both efficiency and higher accuracy for long read mapping and quantification than previous tools. We demonstrate that long read RNA-seq has reached sufficient depth and accuracy to yield accurate quantification of isoform-level expression for differential expression analysis. Furthermore, we explore the feasibilty of also utilizing long k-mers and pseudoalignment in both transcript discovery in dn-kallisto and gene fusion and immune receptor sequence discovery with fugi with measured success. Thus, our tools will enable a more complete, accurate, and scalable analysis of single-cell and bulk RNA-seq than has hitherto been possible in both quantifications and differential expression analysis as well as investigation of gene fusions, chimeric RNAs, and immune receptor sequences without bias.

Item Type:Thesis (Dissertation (Ph.D.))
Subject Keywords:RNA-seq; long read; k-mers; t-DBG; fusion; TCR; gene isoform; transcript discovery; single-cell; single-nuclei; distinguishing flanking k-mers
Degree Grantor:California Institute of Technology
Division:Biology and Biological Engineering
Major Option:Biology
Minor Option:Systems Biology
Thesis Availability:Public (worldwide access)
Research Advisor(s):
  • Pachter, Lior S.
Thesis Committee:
  • Thomson, Matthew (chair)
  • Wold, Barbara J.
  • Perona, Pietro
  • Mortazavi, Ali
  • Pachter, Lior S.
Defense Date:30 May 2025
Non-Caltech Author Email:bekahloving (AT) gmail.com
Funders:
Funding AgencyGrant Number
Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Department of Energy Computational Science Graduate FellowshipDE-SC0020347
National Institute of Health NHGRIUM1 HG012077
Projects:lr-kallisto
Record Number:CaltechTHESIS:06012025-204136107
Persistent URL:https://resolver.caltech.edu/CaltechTHESIS:06012025-204136107
DOI:10.7907/3nz8-3c83
Related URLs:
URLURL TypeDescription
https://doi.org/10.1101/2024.07.19.604364DOIlead author article adapted for Chapter 2
https://doi.org/10.1101/2025.04.21.649844DOIco-authored article: Systematic cell-type resolved transcriptomes of 8 tissues in 8 lab and wild-derived mouse strains captures global and local expression variation
https://doi.org/10.1186/s12863-023-01151-4DOIco-authored article: Overestimated prediction using polygenic prediction derived from summary statistics
https://drive.google.com/file/d/1MtGfp1Zhecr-tqK1qbMEXWnnPF-R3Ung/viewOtherlead author conference article: A network fusion model pipeline for multi-modal, deep learning for tree crown detection
ORCID:
AuthorORCID
Loving Ngo, Rebekah Kiana0000-0001-8725-0376
Default Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:17344
Collection:CaltechTHESIS
Deposited By: Rebekah Loving
Deposited On:04 Jun 2025 22:33
Last Modified:17 Jun 2025 17:09

Thesis Files

[img] PDF - Final Version
See Usage Policy.

51MB

Repository Staff Only: item control page