A Caltech Library Service

Data-Driven Protein Engineering


Wu, Zachary (2021) Data-Driven Protein Engineering. Dissertation (Ph.D.), California Institute of Technology. doi:10.7907/nx3c-qb44.


Directed evolution has enabled the adaptation of natural protein sequences for an endless variety of human applications. Given a starting point - a sequence with measurable activity - directed evolution is able to improve protein sequences by iteratively accumulating beneficial mutations. However, directed evolution requires investing large experimental effort, which continues to be the major bottleneck in efficient protein optimization. To this end, we describe a framework for incorporating machine learning in the directed evolution process to maximize the utility of generated experimental data in Chapter 2. In Chapter 3, we then show that this framework outperforms traditional directed evolution methods on an empirical fitness landscape. However, directed evolution is fundamentally limited by its need for a starting point, or a sequence with measurable activity. To tackle this issue, we test the ability of nascent deep learning techniques for generating short, functional amino acid sequences in Chapter 4. Encouraged by this success, we attempted to generate full length enzymatic sequences for desired substrates without success. However, we were able to apply this deep learning approach to model other aspects of enzymatic protein sequences in Chapter 5. Finally, the field of data-driven protein sequence generation is enjoying a recent surge in interest, and we provide an updated review of protein engineering with machine learning, focusing on recent work in deep generative modeling in Chapter 1.

Item Type:Thesis (Dissertation (Ph.D.))
Subject Keywords:Protein engineering, machine learning, directed evolution
Degree Grantor:California Institute of Technology
Division:Chemistry and Chemical Engineering
Major Option:Chemical Engineering
Thesis Availability:Public (worldwide access)
Research Advisor(s):
  • Arnold, Frances Hamilton
Thesis Committee:
  • Tirrell, David A. (chair)
  • Wang, Zhen-Gang
  • Yue, Yisong
  • Arnold, Frances Hamilton
Defense Date:25 June 2020
Non-Caltech Author Email:zacharywu (AT)
Funding AgencyGrant Number
Record Number:CaltechTHESIS:01042021-220032574
Persistent URL:
Related URLs:
URLURL TypeDescription adapted for Chapters 2 and 3. adapted for Chapter 4. 1 is the successor (update) to this article.
Wu, Zachary0000-0003-2429-9812
Default Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:14045
Deposited By: Zachary Wu
Deposited On:13 Jan 2021 16:53
Last Modified:08 Nov 2023 00:11

Thesis Files

[img] PDF - Final Version
See Usage Policy.


Repository Staff Only: item control page