A Caltech Library Service

Methods for Collection and Processing of Gene Expression Data


Murphy, John Frank (2005) Methods for Collection and Processing of Gene Expression Data. Dissertation (Ph.D.), California Institute of Technology. doi:10.7907/8p8e-2147.


Examination of the transcriptional messages encoded in the manifold of mRNA molecules within a cell is a central task of molecular biology and functional genomics. This examination can be broken down into two parts: collection of gene expression data, and analyses of those data. Here, a new method for collecting gene expression data, and two new methods for analyzing those data are presented.

A new method for quantifying gene expression denoted as the Mass-spectrometric Analysis of Gene Expression (MAGE) is developed. MAGE relies on novel conjugates of DNA oligonucleotide 30-mers; each unique sequence is conjugated via photolabile linker to an N-substituted glycine oligomer (peptoid) of unique mass. Deuterated bromoacetic acid is incorporated into some peptoids yielding two chemically identical probe conjugates of different molecular weights for each nucleic acid sequence of interest. Mixtures of these probes, along with 3' adjacent biotin-labeled oligonucleotides, are used to interrogate a target mixture of cDNA. Following hybridization, the two adjacent probes are ligated to enhance the specificity of the identification, and to enable the use of a biotin-affinity column for removal of confounding peptoid tags. The resulting mixture is exposed to longwave ultraviolet light to release the peptoid tags, that are quantified using MALDI-TOF mass spectrometry using the isotopically labeled peptoids as internal standards. These individual components of MAGE are demonstrated.

A strategy for simplification and visualizing of high-dimensional gene expression data, as well as a strategy for inferring the presence of clusters within those data, is formulated and implemented. In order to visualize high-dimensional gene expression data, principle components analysis is used with subsequent mapping of the data onto an orthogonal set of basis functions known as Andrews curves. This analysis method is demonstrated by visualizing of breast cancer tumor data and yeast sporulation data. In order to cluster gene expression data, the expectation-maximization algorithm is employed to optimize the parameters of a mixture model of Lorentzian distributions. The difference between Lorentzian and Gaussian mixture models is first demonstrated with artificial data, and then applied to yeast sporulation data. The results indicate that mixtures of Lorentzian distributions may have significant utility for gene expression analysis.

The tools demonstrated here offer unique advantages when compared to the current suite of experimental and analytical tools employed by investigators of functional genomics.

Item Type:Thesis (Dissertation (Ph.D.))
Subject Keywords:Andrews curves; bioconjugations; clustering; gene expression; Lorentzian distributions; MAGE; mixture modeling; mRNA quantification; multivariate data; oligonucleotides; peptoids; visualization
Degree Grantor:California Institute of Technology
Division:Chemistry and Chemical Engineering
Major Option:Chemical Engineering
Thesis Availability:Public (worldwide access)
Research Advisor(s):
  • Davis, Mark E.
Thesis Committee:
  • Davis, Mark E. (chair)
  • Asthagiri, Anand R.
  • Wold, Barbara J.
  • Tirrell, David A.
Defense Date:21 June 2004
Record Number:CaltechETD:etd-06252004-101813
Persistent URL:
Default Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:2723
Deposited By: Imported from ETD-db
Deposited On:30 Jun 2004
Last Modified:19 Apr 2021 22:38

Thesis Files

PDF - Final Version
See Usage Policy.


Repository Staff Only: item control page