A Caltech Library Service

Strategies and Tools for Machine Learning-Assisted Protein Engineering


Wittmann, Bruce James (2022) Strategies and Tools for Machine Learning-Assisted Protein Engineering. Dissertation (Ph.D.), California Institute of Technology. doi:10.7907/azzt-0q97.


Proteins perform critical roles in a growing list of human-devised applications, and as demands for new applications arise, new proteins must be engineered to meet them. Machine learning-assisted protein engineering (MLPE) has recently arisen as a new philosophy of protein engineering, promising to overcome many of the limitations of existing engineering strategies. Despite its promise, however, as a relatively new approach to protein engineering, MLPE faces many challenges that hinder its routine application. This thesis is focused on addressing a number of them. Chapter 1 provides a theoretical overview of protein engineering, introduces the core steps of a typical MLPE pipeline, and discusses the challenges that currently hinder MLPE’s advancement. This chapter is written to be accessible to all members of the highly multidisciplinary audience that either use or develop MLPE tools, in turn providing a resource that eliminates the steep barrier to entry that can hinder broader participation in the field. Chapter 2 provides a solution to the challenge of applying MLPE to proteins whose fitness landscapes are dominated by “holes” (protein variants with zero or extremely low fitness). Using my development of the strategy “focused training machine learning-assisted directed evolution (ftMLDE)” as an example, I demonstrate how auxiliary information from protein sequence and structure can be used to navigate landscapes despite holes, in turn dramatically improving the efficiency of MLPE. Chapter 3 explores strategies for reducing the amount of sequence-fitness data needed for building MLPE models. Specifically, I detail the motivation behind and development of a new model designed to augment limited protein sequence-fitness datasets with information extracted from raw protein sequence and structure data. Finally, chapter 4 introduces “every variant sequencing” (evSeq), a collection of tools and protocols that enables extremely low-cost, routine collection of large protein sequence-fitness datasets. Not only does this technology drastically improve the financial feasibility of numerous MLPE applications, but it also potentiates the construction of a massive database of diverse protein sequence-fitness data, the likes of which would revolutionize our ability to engineer proteins with data-driven methods. Overall, the work described in this thesis advances both our understanding of MLPE and our ability to engineer proteins using it.

Item Type:Thesis (Dissertation (Ph.D.))
Subject Keywords:Machine Learning, Protein Engineering, Sequencing, Natural Language, Directed Evolution
Degree Grantor:California Institute of Technology
Division:Biology and Biological Engineering
Major Option:Bioengineering
Thesis Availability:Public (worldwide access)
Research Advisor(s):
  • Arnold, Frances Hamilton
Thesis Committee:
  • Pachter, Lior S. (chair)
  • Reisman, Sarah E.
  • Mayo, Stephen L.
  • Arnold, Frances Hamilton
Defense Date:26 May 2022
Funding AgencyGrant Number
NSF Division of Chemical, Bioengineering, Environmental and Transport SystemsCBET 1937902
Amgen Chem-Bio-Engineering AwardCBEA
U.S. Department of Energy, Office of Science, Office of Basic Energy SciencesDE-SC0022218
Camille and Henry Dreyfus FoundationML-20-194
Caltech Carver Mead New Adventure Seed FundUNSPECIFIED
Record Number:CaltechTHESIS:05262022-234214451
Persistent URL:
Related URLs:
URLURL TypeDescription adapted for chapters 1 and 3 adapted for chapter 2 adapted for chapter 4 published work not included in thesis published work not included in thesis published work not included in thesis
Wittmann, Bruce James0000-0001-8144-9157
Default Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:14631
Deposited By: Bruce Wittmann
Deposited On:06 Jun 2022 17:59
Last Modified:08 Nov 2023 00:11

Thesis Files

[img] PDF (Thesis) - Final Version
See Usage Policy.

[img] MS Excel (Data S1.csv) - Supplemental Material
See Usage Policy.

[img] MS Excel (Data S2.csv) - Supplemental Material
See Usage Policy.

[img] MS Excel (Data S3.csv) - Supplemental Material
See Usage Policy.


Repository Staff Only: item control page