Citation
Yang, Kevin Kaichuang (2019) Probabilistic Protein Engineering. Dissertation (Ph.D.), California Institute of Technology. doi:10.7907/ZR4M-K630. https://resolver.caltech.edu/CaltechTHESIS:12222018-173706714
Abstract
Machine learning-guided protein engineering is a new paradigm that enables the optimization of complex protein functions. Machine-learning methods use data to predict protein function without requiring a detailed model of the underlying physics or biological pathways. They accelerate protein engineering by learning from information contained in all measured variants and using it to select variants that are likely to be improved. We begin with a review of the basics of machine learning with a focus on applications to protein engineering and protein sequence-function datasets (Chapter 1). We used the entire machine-learning guided engineering paradigm to engineer the algal-derived light-gated channel channelrhodopsin (ChR), which can be used to modulate neuronal activity with light. We build models that discover ChRs with strong plasma membrane localization in mammalian cells (Chapter 2) and unprecedented light sensitivity and photocurrents for optogenetic applications (Chapter 3). Machine learning-guided evolution requires a machine-learning model that learns the relationship between sequence and function. For machine-learning models to learn about protein sequences, protein sequences must be represented as vectors or matrices of numbers. How each protein sequence is represented determines what can be learned. We learn continuous vector encodings of sequences from patterns in unlabeled sequences (Chapter 4). Learned encodings are low-dimensional, do not require alignments, and may improve performance by transferring information in unlabeled sequences to specific prediction tasks. Alternately, we demonstrate an interpretable Gaussian process kernel tailored to biological sequences (Chapter 6). In addition to a model to predict function from sequence, engineering requires a method to use the model to choose sequences for the next round of evolution. Most machine-learning guided engineering strategies assume that selected sequences can be queried directly. However, in directed evolution it is common to design a library of sequences and then sample stochastic batches from that library. We propose a batched stochastic Bayesian optimization algorithm for iteratively designing and screening site-saturation mutagenesis libraries (Chapter 5).
Item Type: | Thesis (Dissertation (Ph.D.)) | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Subject Keywords: | Protein engineering, machine learning | ||||||||||||
Degree Grantor: | California Institute of Technology | ||||||||||||
Division: | Chemistry and Chemical Engineering | ||||||||||||
Major Option: | Chemical Engineering | ||||||||||||
Thesis Availability: | Public (worldwide access) | ||||||||||||
Research Advisor(s): |
| ||||||||||||
Group: | Rosen Bioengineering Center | ||||||||||||
Thesis Committee: |
| ||||||||||||
Defense Date: | 14 December 2018 | ||||||||||||
Non-Caltech Author Email: | yangkky (AT) gmail.com | ||||||||||||
Record Number: | CaltechTHESIS:12222018-173706714 | ||||||||||||
Persistent URL: | https://resolver.caltech.edu/CaltechTHESIS:12222018-173706714 | ||||||||||||
DOI: | 10.7907/ZR4M-K630 | ||||||||||||
Related URLs: |
| ||||||||||||
ORCID: |
| ||||||||||||
Default Usage Policy: | No commercial reproduction, distribution, display or performance rights in this work are provided. | ||||||||||||
ID Code: | 11326 | ||||||||||||
Collection: | CaltechTHESIS | ||||||||||||
Deposited By: | Kevin Yang | ||||||||||||
Deposited On: | 08 Jan 2019 20:54 | ||||||||||||
Last Modified: | 08 Nov 2023 00:11 |
Thesis Files
|
PDF (Dissertation)
- Final Version
See Usage Policy. 18MB |
Repository Staff Only: item control page