Citation
Li, Francesca-Zhoufan (2025) Evaluation of the Generalizability of Machine Learning-Assisted Protein Engineering Methods. Dissertation (Ph.D.), California Institute of Technology. doi:10.7907/yzb2-cb66. https://resolver.caltech.edu/CaltechTHESIS:04292025-041809434
Abstract
Engineered proteins can carry out a vast array of functions and have become indispensable across numerous industrial applications. To accelerate wet-lab protein engineering efforts, machine learning-based methods have advanced rapidly. However, a gap remains between state-of-the-art machine learning methods and their practical adoption. A key factor contributing to this disconnect is the lack of application-relevant benchmarking and generalizable insights across protein engineering tasks. This thesis evaluates machine learning-assisted protein engineering approaches to identify generalizable strategies. The central problem considered is learning the mapping from protein sequence to function—known as the fitness landscape—to enable the prediction of unseen variant fitness. Chapter 1 introduces the background and context for machine learning-assisted protein engineering and highlights the practical constraint of limited experimental budgets. Chapter 2 investigates transfer learning, which leverages models pretrained on large protein sequence databases to generate informative representations for modeling task specific sequence-function relationships. Evaluation across ten diverse tasks shows that while transfer learning is effective in structure prediction, it underperforms in variant fitness prediction—a key objective in protein engineering. Chapter 3 evaluates alternative strategies with a focus on combinatorial fitness landscapes, a common setting in protein engineering. Across 16 diverse landscapes, focused training improves the performance of various machine learning approaches by strategically selecting training variants using zero-shot predictors, which estimate variant fitness from auxiliary information without relying on experimental data. Building on these insights, Chapter 4 addresses the specific challenge of engineering enzymes—proteins that convert substrates into products—for novel chemistries. While six general zero-shot predictors without substrate information can predict enzyme activity on non-native substrates, they fail on more out-of-distribution, new-to-nature chemistries. Incorporating substrate information into zero-shot predictors leads to more generalizable performance across all tested chemistries, spanning 22 substrates. Overall, this thesis identifies generalizable strategies for machine learning-assisted protein engineering by systematically evaluating and improving how sequence-to-function relationships are modeled across diverse tasks.
Item Type: | Thesis (Dissertation (Ph.D.)) | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Subject Keywords: | protein engineering, machine learning, evaluation, generalizability, fitness prediction, protein language models, transfer learning, combinatorial mutagenesis, directed evolution, epistasis, zero-shot predictor, enzyme engineering, substrate-aware, non-native substrate, new-to-nature | ||||||||||||
Degree Grantor: | California Institute of Technology | ||||||||||||
Division: | Biology and Biological Engineering | ||||||||||||
Major Option: | Bioengineering | ||||||||||||
Thesis Availability: | Public (worldwide access) | ||||||||||||
Research Advisor(s): |
| ||||||||||||
Thesis Committee: |
| ||||||||||||
Defense Date: | 12 May 2025 | ||||||||||||
Non-Caltech Author Email: | francesca.zf.l (AT) berkeley.edu | ||||||||||||
Funders: |
| ||||||||||||
Record Number: | CaltechTHESIS:04292025-041809434 | ||||||||||||
Persistent URL: | https://resolver.caltech.edu/CaltechTHESIS:04292025-041809434 | ||||||||||||
DOI: | 10.7907/yzb2-cb66 | ||||||||||||
Related URLs: |
| ||||||||||||
ORCID: |
| ||||||||||||
Default Usage Policy: | No commercial reproduction, distribution, display or performance rights in this work are provided. | ||||||||||||
ID Code: | 17182 | ||||||||||||
Collection: | CaltechTHESIS | ||||||||||||
Deposited By: | Francesca-Zhoufan Li | ||||||||||||
Deposited On: | 04 Jun 2025 22:33 | ||||||||||||
Last Modified: | 11 Jun 2025 17:09 |
Thesis Files
![]() |
PDF
- Final Version
See Usage Policy. 29MB |
Repository Staff Only: item control page