Citation
Wittmann, Bruce James (2022) Strategies and Tools for Machine Learning-Assisted Protein Engineering. Dissertation (Ph.D.), California Institute of Technology. doi:10.7907/azzt-0q97. https://resolver.caltech.edu/CaltechTHESIS:05262022-234214451
Abstract
Proteins perform critical roles in a growing list of human-devised applications, and as demands for new applications arise, new proteins must be engineered to meet them. Machine learning-assisted protein engineering (MLPE) has recently arisen as a new philosophy of protein engineering, promising to overcome many of the limitations of existing engineering strategies. Despite its promise, however, as a relatively new approach to protein engineering, MLPE faces many challenges that hinder its routine application. This thesis is focused on addressing a number of them. Chapter 1 provides a theoretical overview of protein engineering, introduces the core steps of a typical MLPE pipeline, and discusses the challenges that currently hinder MLPE’s advancement. This chapter is written to be accessible to all members of the highly multidisciplinary audience that either use or develop MLPE tools, in turn providing a resource that eliminates the steep barrier to entry that can hinder broader participation in the field. Chapter 2 provides a solution to the challenge of applying MLPE to proteins whose fitness landscapes are dominated by “holes” (protein variants with zero or extremely low fitness). Using my development of the strategy “focused training machine learning-assisted directed evolution (ftMLDE)” as an example, I demonstrate how auxiliary information from protein sequence and structure can be used to navigate landscapes despite holes, in turn dramatically improving the efficiency of MLPE. Chapter 3 explores strategies for reducing the amount of sequence-fitness data needed for building MLPE models. Specifically, I detail the motivation behind and development of a new model designed to augment limited protein sequence-fitness datasets with information extracted from raw protein sequence and structure data. Finally, chapter 4 introduces “every variant sequencing” (evSeq), a collection of tools and protocols that enables extremely low-cost, routine collection of large protein sequence-fitness datasets. Not only does this technology drastically improve the financial feasibility of numerous MLPE applications, but it also potentiates the construction of a massive database of diverse protein sequence-fitness data, the likes of which would revolutionize our ability to engineer proteins with data-driven methods. Overall, the work described in this thesis advances both our understanding of MLPE and our ability to engineer proteins using it.
Item Type: | Thesis (Dissertation (Ph.D.)) | |||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Subject Keywords: | Machine Learning, Protein Engineering, Sequencing, Natural Language, Directed Evolution | |||||||||||||||||||||
Degree Grantor: | California Institute of Technology | |||||||||||||||||||||
Division: | Biology and Biological Engineering | |||||||||||||||||||||
Major Option: | Bioengineering | |||||||||||||||||||||
Thesis Availability: | Public (worldwide access) | |||||||||||||||||||||
Research Advisor(s): |
| |||||||||||||||||||||
Thesis Committee: |
| |||||||||||||||||||||
Defense Date: | 26 May 2022 | |||||||||||||||||||||
Funders: |
| |||||||||||||||||||||
Record Number: | CaltechTHESIS:05262022-234214451 | |||||||||||||||||||||
Persistent URL: | https://resolver.caltech.edu/CaltechTHESIS:05262022-234214451 | |||||||||||||||||||||
DOI: | 10.7907/azzt-0q97 | |||||||||||||||||||||
Related URLs: |
| |||||||||||||||||||||
ORCID: |
| |||||||||||||||||||||
Default Usage Policy: | No commercial reproduction, distribution, display or performance rights in this work are provided. | |||||||||||||||||||||
ID Code: | 14631 | |||||||||||||||||||||
Collection: | CaltechTHESIS | |||||||||||||||||||||
Deposited By: | Bruce Wittmann | |||||||||||||||||||||
Deposited On: | 06 Jun 2022 17:59 | |||||||||||||||||||||
Last Modified: | 08 Nov 2023 00:11 |
Thesis Files
![]() |
PDF (Thesis)
- Final Version
See Usage Policy. 10MB |
![]() |
MS Excel (Data S1.csv)
- Supplemental Material
See Usage Policy. 12kB |
![]() |
MS Excel (Data S2.csv)
- Supplemental Material
See Usage Policy. 103kB |
![]() |
MS Excel (Data S3.csv)
- Supplemental Material
See Usage Policy. 29kB |
Repository Staff Only: item control page