A Caltech Library Service

Machine Learning and Modeling Methods for Protein Engineering


Aceves, Aiden Joseph (2022) Machine Learning and Modeling Methods for Protein Engineering. Dissertation (Ph.D.), California Institute of Technology. doi:10.7907/5j34-0d55.


Computation has been an integral part of structural biology, ever since the first protein macromolecular structure was solved via Fourier Synthesis on the EDSAC Mark I electronic computer in 1958 (Kendrew et al., 1958). Throughout my time at Caltech, I have endeavored to develop new methods to apply machine learning and molecular modeling to the study of biological macromolecules. These efforts have taken two distinct tracks, but are unified by a focus on studying proteins on a structural level.

Through the application of molecular dynamics and modeling, I have studied insulin from several angles, including the incorporation of non-canonical amino acids, and how these modifications might be responsible for the modification of critical properties such as hexamer dissociation and fibrillation formation. Additionally, I have probed how insulin behaves at the interface of water and silica, a property which is critical for the effective dissemination and administration of this therapeutic molecule. I have helped to develop a novel computationally guided workflow for integrating drug conjugates into antibody CDRs. This technique yields molecules which exhibit synergistic binding and an enhanced ability for selective binding.

The second major thrust of my research has focused on applying machine learning to protein engineering problems, particularly developing tools for working with structural data, and for making efficient re-use of data which has already been laboriously collected by other groups. The basic data parsing and processing tools which were created and refined over the course of my time at Caltech has enabled many other projects, both of my own and of collaborators. Studies into the use of generative networks for protein-protein docking have been conducted which lend useful insights for network architecture, the inclusion of intermediate learning objectives, and overcoming sparsity. The technique introduced in our ICLR 2021 paper demonstrates a regularization method which enables data from past protein engineering campaigns to be leveraged to learn policies which optimally select molecules to synthesize in unrelated engineering efforts, to potentially save a significant amount of time and money for future projects.


Kendrew, J. C.; Bodo, G.; Dintzis, H. M.; Parrish, R. G.; Wyckoff, H.; Phillips, D. C. A. "Three-Dimensional Model of the Myoglobin Molecule Obtained by X-Ray Analysis". Nature 1958, 181 (4610), 662–666.

Item Type:Thesis (Dissertation (Ph.D.))
Subject Keywords:Machine learning, artificial intelligence, ai, protein, engineering, insulin, docking, crystallography, surface adhesion, glass
Degree Grantor:California Institute of Technology
Division:Biology and Biological Engineering
Major Option:Bioengineering
Thesis Availability:Public (worldwide access)
Research Advisor(s):
  • Mayo, Stephen L.
Thesis Committee:
  • Bjorkman, Pamela J. (chair)
  • Miller, Thomas F.
  • Yue, Yisong
  • Van Valen, David A.
  • Mayo, Stephen L.
Defense Date:18 August 2021
Non-Caltech Author Email:ajaceves (AT)
Record Number:CaltechTHESIS:09092021-024951318
Persistent URL:
Related URLs:
URLURL TypeDescription 2021 paper adapted as section 1.5
Default Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:14357
Deposited By: Aiden Aceves
Deposited On:28 Sep 2021 16:12
Last Modified:21 Jun 2022 22:50

Thesis Files

[img] PDF - Final Version
See Usage Policy.


Repository Staff Only: item control page