A Caltech Library Service

Acquiring Enzyme Sequence-Fitness Data at Scale Toward Predictive Methods for Enzyme Engineering


Johnston, Kadina Elizabeth (2024) Acquiring Enzyme Sequence-Fitness Data at Scale Toward Predictive Methods for Enzyme Engineering. Dissertation (Ph.D.), California Institute of Technology. doi:10.7907/xjz4-k217.


The emergence of machine learning methods for expediting directed evolution via protein fitness prediction has recently shed light on the need for more, high quality sequence-fitness data from which to learn the mapping from sequence to fitness. Enzymes specifically are highly selective catalysts and engineered enzymes are becoming increasingly important for human applications such as pharmaceutical synthesis. This thesis thus focuses on the collection of enzymatic sequence-fitness data to enable both development and validation of emerging approaches. Chapter 1 describes the process of traditional directed evolution as well as ways that machine learning methods have been used to accelerate it. It also discusses the experimental considerations for applying machine learning to the various steps of protein engineering campaigns, as the experimental constraints are not always obvious to the machine learning community. One of the major constraints for the application of machine learning methods is the requirement to sequence all variants required for model training, a step that is often skipped by traditional, lab-only directed evolution due to it not being worth the time and cost. Chapter 2 introduces a solution to this problem with “every variant sequencing” (evSeq), which enables higher throughput collection of sequencing data for a similar time and cost as commonly used Sanger sequencing methods. This method not only enables implementation of ML methods such as machine learning-assisted directed evolution (MLDE) and focused training MLDE (ftMLDE) by sequencing variants during an evolution campaign, but also offers promise to fill existing protein sequence-fitness databases with protein engineering datasets. This type of data collection can enable the development of newer, more accurate ML methods, and was an inspiration for the work presented in Chapter 3, which details the collection of a combinatorially complete, epistatic sequence-fitness landscape in an enzyme active site. Oftentimes, the effects of mutations on protein fitness can be considered largely independent and laboratory recombination of them can find an optimal variant. This general principle breaks down when the effects of mutations are not independent, termed epistasis, and sequence-fitness landscapes with these interactions are difficult to traverse. Thus, collection of this dataset provides a challenging task for the development of both ML and physics-based models and pushes the boundary of predictive methods for protein engineering.

Item Type:Thesis (Dissertation (Ph.D.))
Subject Keywords:enzyme engineering; machine learning; fitness landscape; tryptophan synthase; sequencing;
Degree Grantor:California Institute of Technology
Division:Biology and Biological Engineering
Major Option:Bioengineering
Thesis Availability:Public (worldwide access)
Research Advisor(s):
  • Arnold, Frances Hamilton
Thesis Committee:
  • Yue, Yisong (chair)
  • Mayo, Stephen L.
  • Bois, Justin S.
  • Arnold, Frances Hamilton
Defense Date:18 September 2023
Funding AgencyGrant Number
Amazon AI4Science FellowshipUNSPECIFIED
Biotechnology Leadership Program (NIH)UNSPECIFIED
Amgen Chem-Bio-Engineering AwardCBEA
NSF Division of Chemical, Bioengineering, Environmental and Transport SystemsCBET 1937902
U.S. Department of Energy, Office of Science, Office of Basic Energy SciencesDE-SC0022218
Caltech Center for Environmental SciencesUNSPECIFIED
Record Number:CaltechTHESIS:09212023-182422125
Persistent URL:
Related URLs:
URLURL TypeDescription adapted for Chapter 1 of second article adapted for Chapter 1 adapted for Chapter 2
Johnston, Kadina Elizabeth0000-0002-2214-3534
Default Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:16185
Deposited By: Kadina Johnston
Deposited On:25 Sep 2023 21:24
Last Modified:08 May 2024 18:38

Thesis Files

[img] PDF - Final Version
See Usage Policy.


Repository Staff Only: item control page