CaltechTHESIS
  A Caltech Library Service

Evaluation of the Generalizability of Machine Learning-Assisted Protein Engineering Methods

Citation

Li, Francesca-Zhoufan (2025) Evaluation of the Generalizability of Machine Learning-Assisted Protein Engineering Methods. Dissertation (Ph.D.), California Institute of Technology. doi:10.7907/yzb2-cb66. https://resolver.caltech.edu/CaltechTHESIS:04292025-041809434

Abstract

Engineered proteins can carry out a vast array of functions and have become indispensable across numerous industrial applications. To accelerate wet-lab protein engineering efforts, machine learning-based methods have advanced rapidly. However, a gap remains between state-of-the-art machine learning methods and their practical adoption. A key factor contributing to this disconnect is the lack of application-relevant benchmarking and generalizable insights across protein engineering tasks. This thesis evaluates machine learning-assisted protein engineering approaches to identify generalizable strategies. The central problem considered is learning the mapping from protein sequence to function—known as the fitness landscape—to enable the prediction of unseen variant fitness. Chapter 1 introduces the background and context for machine learning-assisted protein engineering and highlights the practical constraint of limited experimental budgets. Chapter 2 investigates transfer learning, which leverages models pretrained on large protein sequence databases to generate informative representations for modeling task specific sequence-function relationships. Evaluation across ten diverse tasks shows that while transfer learning is effective in structure prediction, it underperforms in variant fitness prediction—a key objective in protein engineering. Chapter 3 evaluates alternative strategies with a focus on combinatorial fitness landscapes, a common setting in protein engineering. Across 16 diverse landscapes, focused training improves the performance of various machine learning approaches by strategically selecting training variants using zero-shot predictors, which estimate variant fitness from auxiliary information without relying on experimental data. Building on these insights, Chapter 4 addresses the specific challenge of engineering enzymes—proteins that convert substrates into products—for novel chemistries. While six general zero-shot predictors without substrate information can predict enzyme activity on non-native substrates, they fail on more out-of-distribution, new-to-nature chemistries. Incorporating substrate information into zero-shot predictors leads to more generalizable performance across all tested chemistries, spanning 22 substrates. Overall, this thesis identifies generalizable strategies for machine learning-assisted protein engineering by systematically evaluating and improving how sequence-to-function relationships are modeled across diverse tasks.

Item Type:Thesis (Dissertation (Ph.D.))
Subject Keywords:protein engineering, machine learning, evaluation, generalizability, fitness prediction, protein language models, transfer learning, combinatorial mutagenesis, directed evolution, epistasis, zero-shot predictor, enzyme engineering, substrate-aware, non-native substrate, new-to-nature
Degree Grantor:California Institute of Technology
Division:Biology and Biological Engineering
Major Option:Bioengineering
Thesis Availability:Public (worldwide access)
Research Advisor(s):
  • Arnold, Frances Hamilton (advisor)
  • Yue, Yisong (advisor)
Thesis Committee:
  • Murray, Richard M. (chair)
  • Mayo, Stephen L. (co-chair)
  • Yang, Kevin K.
  • Arnold, Frances Hamilton
  • Yue, Yisong
Defense Date:12 May 2025
Non-Caltech Author Email:francesca.zf.l (AT) berkeley.edu
Funders:
Funding AgencyGrant Number
Graduate Research Fellowship ProgramUNSPECIFIED
Amazon AI4Science FellowshipUNSPECIFIED
NSF Division of Chemical, Bioengineering, Environmental and Transport SystemsCBET 1937902
Amgen Chem-Bio-Engineering AwardCBEA AMGEN.ARNOLD22
U.S. Department of Energy, Office of Science, Office of Basic Energy SciencesDE-SC0022218
Record Number:CaltechTHESIS:04292025-041809434
Persistent URL:https://resolver.caltech.edu/CaltechTHESIS:04292025-041809434
DOI:10.7907/yzb2-cb66
Related URLs:
URLURL TypeDescription
https://proceedings.mlr.press/v235/li24a.htmlDOIArticle adapted for chapter 2
https://doi.org/10.1101/2024.10.24.619774DOIArticle adapted for chapter 3
https://openreview.net/forum?id=IqPlnXw1BJarXivArticle adapted for chapter 4
ORCID:
AuthorORCID
Li, Francesca-Zhoufan0000-0002-5710-9512
Default Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:17182
Collection:CaltechTHESIS
Deposited By: Francesca-Zhoufan Li
Deposited On:04 Jun 2025 22:33
Last Modified:11 Jun 2025 17:09

Thesis Files

[img] PDF - Final Version
See Usage Policy.

29MB

Repository Staff Only: item control page