Acquiring Enzyme Sequence-Fitness Data at Scale Toward Predictive Methods for Enzyme Engineering

Citation

Johnston, Kadina Elizabeth (2024) Acquiring Enzyme Sequence-Fitness Data at Scale Toward Predictive Methods for Enzyme Engineering. Dissertation (Ph.D.), California Institute of Technology. doi:10.7907/xjz4-k217. https://resolver.caltech.edu/CaltechTHESIS:09212023-182422125

Abstract

The emergence of machine learning methods for expediting directed evolution via protein fitness prediction has recently shed light on the need for more, high quality sequence-fitness data from which to learn the mapping from sequence to fitness. Enzymes specifically are highly selective catalysts and engineered enzymes are becoming increasingly important for human applications such as pharmaceutical synthesis. This thesis thus focuses on the collection of enzymatic sequence-fitness data to enable both development and validation of emerging approaches. Chapter 1 describes the process of traditional directed evolution as well as ways that machine learning methods have been used to accelerate it. It also discusses the experimental considerations for applying machine learning to the various steps of protein engineering campaigns, as the experimental constraints are not always obvious to the machine learning community. One of the major constraints for the application of machine learning methods is the requirement to sequence all variants required for model training, a step that is often skipped by traditional, lab-only directed evolution due to it not being worth the time and cost. Chapter 2 introduces a solution to this problem with “every variant sequencing” (evSeq), which enables higher throughput collection of sequencing data for a similar time and cost as commonly used Sanger sequencing methods. This method not only enables implementation of ML methods such as machine learning-assisted directed evolution (MLDE) and focused training MLDE (ftMLDE) by sequencing variants during an evolution campaign, but also offers promise to fill existing protein sequence-fitness databases with protein engineering datasets. This type of data collection can enable the development of newer, more accurate ML methods, and was an inspiration for the work presented in Chapter 3, which details the collection of a combinatorially complete, epistatic sequence-fitness landscape in an enzyme active site. Oftentimes, the effects of mutations on protein fitness can be considered largely independent and laboratory recombination of them can find an optimal variant. This general principle breaks down when the effects of mutations are not independent, termed epistasis, and sequence-fitness landscapes with these interactions are difficult to traverse. Thus, collection of this dataset provides a challenging task for the development of both ML and physics-based models and pushes the boundary of predictive methods for protein engineering.

Item Type:

Thesis (Dissertation (Ph.D.))

Subject Keywords:

enzyme engineering; machine learning; fitness landscape; tryptophan synthase; sequencing;

Degree Grantor:

California Institute of Technology

Division:

Biology and Biological Engineering

Major Option:

Bioengineering

Thesis Availability:

Public (worldwide access)

Research Advisor(s):

Arnold, Frances Hamilton

Thesis Committee:

Yue, Yisong (chair)
Mayo, Stephen L.
Bois, Justin S.
Arnold, Frances Hamilton

Defense Date:

18 September 2023

Funders:

Funding Agency	Grant Number
Amazon AI4Science Fellowship	UNSPECIFIED
Biotechnology Leadership Program (NIH)	UNSPECIFIED
Amgen Chem-Bio-Engineering Award	CBEA
NSF Division of Chemical, Bioengineering, Environmental and Transport Systems	CBET 1937902
U.S. Department of Energy, Office of Science, Office of Basic Energy Sciences	DE-SC0022218
Caltech Center for Environmental Sciences	UNSPECIFIED

Record Number:

CaltechTHESIS:09212023-182422125

Persistent URL:

https://resolver.caltech.edu/CaltechTHESIS:09212023-182422125

DOI:

10.7907/xjz4-k217

Related URLs:

URL	URL Type	Description
https://doi.org/10.1016/j.sbi.2021.01.008	DOI	Article adapted for Chapter 1
https://doi.org/10.48550/arXiv.2305.16634	DOI	Preprint of second article adapted for Chapter 1
https://doi.org/10.1021/acssynbio.1c00592	DOI	Article adapted for Chapter 2

ORCID:

Author	ORCID
Johnston, Kadina Elizabeth	0000-0002-2214-3534

Default Usage Policy:

No commercial reproduction, distribution, display or performance rights in this work are provided.

ID Code:

16185

Collection:

CaltechTHESIS

Deposited By:

Kadina Johnston

Deposited On:

25 Sep 2023 21:24

Last Modified:

08 May 2024 18:38

Thesis Files

PDF - Final Version
See Usage Policy.
21MB

Repository Staff Only: item control page