Citation
Johnston, Kadina Elizabeth (2024) Acquiring Enzyme Sequence-Fitness Data at Scale Toward Predictive Methods for Enzyme Engineering. Dissertation (Ph.D.), California Institute of Technology. doi:10.7907/xjz4-k217. https://resolver.caltech.edu/CaltechTHESIS:09212023-182422125
Abstract
The emergence of machine learning methods for expediting directed evolution via protein fitness prediction has recently shed light on the need for more, high quality sequence-fitness data from which to learn the mapping from sequence to fitness. Enzymes specifically are highly selective catalysts and engineered enzymes are becoming increasingly important for human applications such as pharmaceutical synthesis. This thesis thus focuses on the collection of enzymatic sequence-fitness data to enable both development and validation of emerging approaches. Chapter 1 describes the process of traditional directed evolution as well as ways that machine learning methods have been used to accelerate it. It also discusses the experimental considerations for applying machine learning to the various steps of protein engineering campaigns, as the experimental constraints are not always obvious to the machine learning community. One of the major constraints for the application of machine learning methods is the requirement to sequence all variants required for model training, a step that is often skipped by traditional, lab-only directed evolution due to it not being worth the time and cost. Chapter 2 introduces a solution to this problem with “every variant sequencing” (evSeq), which enables higher throughput collection of sequencing data for a similar time and cost as commonly used Sanger sequencing methods. This method not only enables implementation of ML methods such as machine learning-assisted directed evolution (MLDE) and focused training MLDE (ftMLDE) by sequencing variants during an evolution campaign, but also offers promise to fill existing protein sequence-fitness databases with protein engineering datasets. This type of data collection can enable the development of newer, more accurate ML methods, and was an inspiration for the work presented in Chapter 3, which details the collection of a combinatorially complete, epistatic sequence-fitness landscape in an enzyme active site. Oftentimes, the effects of mutations on protein fitness can be considered largely independent and laboratory recombination of them can find an optimal variant. This general principle breaks down when the effects of mutations are not independent, termed epistasis, and sequence-fitness landscapes with these interactions are difficult to traverse. Thus, collection of this dataset provides a challenging task for the development of both ML and physics-based models and pushes the boundary of predictive methods for protein engineering.
Item Type: | Thesis (Dissertation (Ph.D.)) | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Subject Keywords: | enzyme engineering; machine learning; fitness landscape; tryptophan synthase; sequencing; | ||||||||||||||
Degree Grantor: | California Institute of Technology | ||||||||||||||
Division: | Biology and Biological Engineering | ||||||||||||||
Major Option: | Bioengineering | ||||||||||||||
Thesis Availability: | Public (worldwide access) | ||||||||||||||
Research Advisor(s): |
| ||||||||||||||
Thesis Committee: |
| ||||||||||||||
Defense Date: | 18 September 2023 | ||||||||||||||
Funders: |
| ||||||||||||||
Record Number: | CaltechTHESIS:09212023-182422125 | ||||||||||||||
Persistent URL: | https://resolver.caltech.edu/CaltechTHESIS:09212023-182422125 | ||||||||||||||
DOI: | 10.7907/xjz4-k217 | ||||||||||||||
Related URLs: |
| ||||||||||||||
ORCID: |
| ||||||||||||||
Default Usage Policy: | No commercial reproduction, distribution, display or performance rights in this work are provided. | ||||||||||||||
ID Code: | 16185 | ||||||||||||||
Collection: | CaltechTHESIS | ||||||||||||||
Deposited By: | Kadina Johnston | ||||||||||||||
Deposited On: | 25 Sep 2023 21:24 | ||||||||||||||
Last Modified: | 08 May 2024 18:38 |
Thesis Files
PDF
- Final Version
See Usage Policy. 21MB |
Repository Staff Only: item control page