A Caltech Library Service

Probabilistic Protein Engineering


Yang, Kevin Kaichuang (2019) Probabilistic Protein Engineering. Dissertation (Ph.D.), California Institute of Technology. doi:10.7907/ZR4M-K630.


Machine learning-guided protein engineering is a new paradigm that enables the optimization of complex protein functions. Machine-learning methods use data to predict protein function without requiring a detailed model of the underlying physics or biological pathways. They accelerate protein engineering by learning from information contained in all measured variants and using it to select variants that are likely to be improved. We begin with a review of the basics of machine learning with a focus on applications to protein engineering and protein sequence-function datasets (Chapter 1). We used the entire machine-learning guided engineering paradigm to engineer the algal-derived light-gated channel channelrhodopsin (ChR), which can be used to modulate neuronal activity with light. We build models that discover ChRs with strong plasma membrane localization in mammalian cells (Chapter 2) and unprecedented light sensitivity and photocurrents for optogenetic applications (Chapter 3). Machine learning-guided evolution requires a machine-learning model that learns the relationship between sequence and function. For machine-learning models to learn about protein sequences, protein sequences must be represented as vectors or matrices of numbers. How each protein sequence is represented determines what can be learned. We learn continuous vector encodings of sequences from patterns in unlabeled sequences (Chapter 4). Learned encodings are low-dimensional, do not require alignments, and may improve performance by transferring information in unlabeled sequences to specific prediction tasks. Alternately, we demonstrate an interpretable Gaussian process kernel tailored to biological sequences (Chapter 6). In addition to a model to predict function from sequence, engineering requires a method to use the model to choose sequences for the next round of evolution. Most machine-learning guided engineering strategies assume that selected sequences can be queried directly. However, in directed evolution it is common to design a library of sequences and then sample stochastic batches from that library. We propose a batched stochastic Bayesian optimization algorithm for iteratively designing and screening site-saturation mutagenesis libraries (Chapter 5).

Item Type:Thesis (Dissertation (Ph.D.))
Subject Keywords:Protein engineering, machine learning
Degree Grantor:California Institute of Technology
Division:Chemistry and Chemical Engineering
Major Option:Chemical Engineering
Thesis Availability:Public (worldwide access)
Research Advisor(s):
  • Arnold, Frances Hamilton
Group:Rosen Bioengineering Center
Thesis Committee:
  • Clemons, William M. (chair)
  • Yue, Yisong
  • Tirrell, David A.
  • Arnold, Frances Hamilton
Defense Date:14 December 2018
Non-Caltech Author Email:yangkky (AT)
Record Number:CaltechTHESIS:12222018-173706714
Persistent URL:
Related URLs:
URLURL TypeDescription
https://10.1093/bioinformatics/bty178DOIArticle adapted for chapter 4.
https://10.1371/journal.pcbi.1005786DOIArticle adapted for chapter 2. adapted for chapter 1.
Yang, Kevin Kaichuang0000-0001-9045-6826
Default Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:11326
Deposited By: Kevin Yang
Deposited On:08 Jan 2019 20:54
Last Modified:08 Nov 2023 00:11

Thesis Files

PDF (Dissertation) - Final Version
See Usage Policy.


Repository Staff Only: item control page