Probabilistic Protein Engineering

Citation

Yang, Kevin Kaichuang (2019) Probabilistic Protein Engineering. Dissertation (Ph.D.), California Institute of Technology. doi:10.7907/ZR4M-K630. https://resolver.caltech.edu/CaltechTHESIS:12222018-173706714

Abstract

Machine learning-guided protein engineering is a new paradigm that enables the optimization of complex protein functions. Machine-learning methods use data to predict protein function without requiring a detailed model of the underlying physics or biological pathways. They accelerate protein engineering by learning from information contained in all measured variants and using it to select variants that are likely to be improved. We begin with a review of the basics of machine learning with a focus on applications to protein engineering and protein sequence-function datasets (Chapter 1). We used the entire machine-learning guided engineering paradigm to engineer the algal-derived light-gated channel channelrhodopsin (ChR), which can be used to modulate neuronal activity with light. We build models that discover ChRs with strong plasma membrane localization in mammalian cells (Chapter 2) and unprecedented light sensitivity and photocurrents for optogenetic applications (Chapter 3). Machine learning-guided evolution requires a machine-learning model that learns the relationship between sequence and function. For machine-learning models to learn about protein sequences, protein sequences must be represented as vectors or matrices of numbers. How each protein sequence is represented determines what can be learned. We learn continuous vector encodings of sequences from patterns in unlabeled sequences (Chapter 4). Learned encodings are low-dimensional, do not require alignments, and may improve performance by transferring information in unlabeled sequences to specific prediction tasks. Alternately, we demonstrate an interpretable Gaussian process kernel tailored to biological sequences (Chapter 6). In addition to a model to predict function from sequence, engineering requires a method to use the model to choose sequences for the next round of evolution. Most machine-learning guided engineering strategies assume that selected sequences can be queried directly. However, in directed evolution it is common to design a library of sequences and then sample stochastic batches from that library. We propose a batched stochastic Bayesian optimization algorithm for iteratively designing and screening site-saturation mutagenesis libraries (Chapter 5).

Item Type:

Thesis (Dissertation (Ph.D.))

Subject Keywords:

Protein engineering, machine learning

Degree Grantor:

California Institute of Technology

Division:

Chemistry and Chemical Engineering

Major Option:

Chemical Engineering

Thesis Availability:

Public (worldwide access)

Research Advisor(s):

Arnold, Frances Hamilton

Group:

Rosen Bioengineering Center

Thesis Committee:

Clemons, William M. (chair)
Yue, Yisong
Tirrell, David A.
Arnold, Frances Hamilton

Defense Date:

14 December 2018

Non-Caltech Author Email:

yangkky (AT) gmail.com

Record Number:

CaltechTHESIS:12222018-173706714

Persistent URL:

https://resolver.caltech.edu/CaltechTHESIS:12222018-173706714

DOI:

10.7907/ZR4M-K630

Related URLs:

URL	URL Type	Description
https://10.1093/bioinformatics/bty178	DOI	Article adapted for chapter 4.
https://10.1371/journal.pcbi.1005786	DOI	Article adapted for chapter 2.
https://arxiv.org/abs/1811.10775	arXiv	Article adapted for chapter 1.

ORCID:

Author	ORCID
Yang, Kevin Kaichuang	0000-0001-9045-6826

Default Usage Policy:

No commercial reproduction, distribution, display or performance rights in this work are provided.

ID Code:

11326

Collection:

CaltechTHESIS

Deposited By:

Kevin Yang

Deposited On:

08 Jan 2019 20:54

Last Modified:

08 Nov 2023 00:11

Thesis Files

Preview

PDF (Dissertation) - Final Version
See Usage Policy.
18MB

Repository Staff Only: item control page