CaltechTHESIS
  A Caltech Library Service

Signal Processing Methods for Genomic Sequence Analysis

Citation

Yoon, Byung-Jun (2007) Signal Processing Methods for Genomic Sequence Analysis. Dissertation (Ph.D.), California Institute of Technology. doi:10.7907/48J3-G286. https://resolver.caltech.edu/CaltechETD:etd-04092007-162353

Abstract

Signal processing is the art of representing, transforming, analyzing, and manipulating signals. It deals with a wide range of signals, from speech and audio signals to images and video signals, and many others. Signal processing techniques have been found very useful in diverse applications. Traditional applications include signal enhancement, denoising, speech recognition, audio and image compression, radar signal processing, and digital communications, just to name a few. In recent years, signal processing techniques have been also applied to the analysis of biological data with considerable success. For example, they have been used for predicting protein-coding genes, analyzing ECG signals and MRI data, enhancing and normalizing DNA microarray images, modeling gene regulatory networks, and so forth.

In this thesis, we consider the application of signal processing methods to the analysis of biological sequences, especially, DNA and RNA molecules. We demonstrate how conventional signal processing techniques--such as digital filters and filter banks--can contribute to this end, and also show how we can extend the traditional models--such as the hidden Markov models (HMMs)--to better serve this purpose.

The first part of the thesis focuses on signal processing methods that can be utilized for analyzing RNA sequences. The primary purposes of this part are to develop a statistical model that is suitable for representing RNA sequence profiles and to propose an effective framework that can be used for finding new homologues (i.e., similar RNAs that are biologically related) of known RNAs. Many functional RNAs have secondary structures that are well conserved among different species. The RNA secondary structure gives rise to long-range correlations between distant bases, which cannot be represented using traditional HMMs. In order to overcome this problem, we propose a new statistical model called the context-sensitive HMM (csHMM). The csHMM is an extension of the traditional HMM, where certain states have variable emission and transition probabilities that depend on the context. The context-sensitive property increases the descriptive power of the model significantly, making csHMMs capable of representing long-range correlations between distant symbols. Based on the proposed model, we present efficient algorithms that can be used for finding the optimal state sequence and computing the probability of an observed symbol string. We also present a training algorithm that can be used for optimizing the parameters of a csHMM. We give several examples that illustrate how csHMMs can be used for modeling various RNA secondary structures and recognizing them.

Based on the concept of csHMM, we introduce profile-csHMMs, which are specifically constructed csHMMs that have linear repetitive structures (i.e., state-transition diagrams). Profile-csHMMs are especially useful for building probabilistic representations of RNA sequence families, including pseudoknots. We also propose a dynamic programming algorithm called the sequential component adjoining (SCA) algorithm that can systematically find the optimal state sequence of an observed symbol string based on a profile-csHMM. In order to demonstrate the effectiveness of profile-csHMMs, we build a structural alignment tool for RNA sequences and show that the profile-csHMM approach can yield highly accurate predictions at a relatively low computational cost. At the end, we describe how the profile-csHMM can be used for finding homologous RNAs, and we propose a practical scheme for making the search significantly faster without affecting the prediction accuracy.

In the second part of the thesis, we focus on the application of digital filters and filter banks in DNA sequence analysis. Firstly, we demonstrate how we can use digital filters for predicting protein-coding genes. Many coding regions in DNA molecules are known to display a period-3 behavior, which can be effectively detected using digital filters. Efficient schemes are proposed that can be used for designing such filters. Experimental results will show that the digital filtering approach can clearly identify the coding regions at a very low computational cost. Secondly, we propose a method based on a bank of IIR lowpass filters that can be used for predicting CpG islands, which are specific regions in DNA molecules that are abundant in the dinucleotide CpG. This filter bank is used to process the sequence of log-likelihood ratios obtained from two Markov chains, where the respective Markov chains model the base transition probabilities inside and outside the CpG islands. The locations of the CpG islands are predicted by analyzing the output signals of the filter bank. It will be shown that the filter bank approach can yield reliable prediction results without sacrificing the resolution of the predicted start/end positions of the CpG islands.

Item Type:Thesis (Dissertation (Ph.D.))
Subject Keywords:bioinformatics; context-sensitive hidden Markov model (csHMM); gene prediction; genomic signal processing; noncoding RNA (ncRNA); profile context-sensitive HMM (profile-csHMM); RNA secondary structure; systems biology
Degree Grantor:California Institute of Technology
Division:Engineering and Applied Science
Major Option:Electrical Engineering
Thesis Availability:Public (worldwide access)
Research Advisor(s):
  • Vaidyanathan, P. P.
Thesis Committee:
  • Vaidyanathan, P. P. (chair)
  • Hassibi, Babak
  • Smolke, Christina D.
  • Ho, Tracey C.
  • Abu-Mostafa, Yaser S.
Defense Date:22 November 2006
Record Number:CaltechETD:etd-04092007-162353
Persistent URL:https://resolver.caltech.edu/CaltechETD:etd-04092007-162353
DOI:10.7907/48J3-G286
Default Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:5182
Collection:CaltechTHESIS
Deposited By: Imported from ETD-db
Deposited On:02 May 2007
Last Modified:07 Jun 2023 17:41

Thesis Files

[img]
Preview
PDF (Yoon_Thesis.pdf) - Final Version
See Usage Policy.

3MB

Repository Staff Only: item control page