AI and Computational Design Advance Protein Engineering
Therapeutic protein design is evolving—and it is doing so in more than one sense of that word. Protein design is being guided by artificial intelligence (AI), which drug developers are using to systematically exploit the complex physical mechanisms behind macromolecule formation in nature. Indeed, drug developers anticipate that AI technology will help them create safer, more effective medicines.
The link between protein structure and function has been known for decades.1 The challenge for drug designers was that, until recently, the specific “rules” governing how amino acid chains fold into three-dimensional structures were poorly understood.
However, in the past few years, computer science tools have allowed researchers to finally understand the mechanics of protein folding. This development animates the work of researchers at protein design organizations and companies such as Evozyne. “Classic approaches to molecular engineering are either structure-based or based on random variation,” says Rama Ranganathan, MD, PhD, Evozyne’s co-founder and chief scientific officer. “A machine learning approach learns the entire evolutionary history of proteins to distill the underlying design principles. This allows us to engineer new proteins with a high probability of meeting even complex multifactorial design goals.”
Ranganathan believes that machine learning, an application of AI that uses mathematical models of data to help a computer learn independently, is reshaping drug design: “First, it produces not just one solution to a problem, but a large library of solutions. That enables secondary screens for additional properties such as immunogenicity, expression in particular cell types, and other idiosyncratic properties that we cannot rationally predict.
“Second, the approach does not require structure-based information or experimental measurements to start. This removes biases due to our intuitions about protein mechanism and due to doing experimental work in specific laboratory conditions.
“Third, the new process enables simultaneous optimization of multiple design goals such as catalytic power, stability, and expression. This is important because trade-offs in optimizing these kinds of properties have led to failures with more conventional methods.”
Ranganathan sums up his case as follows: “Basically, by letting evolution tell you through the models how to engineer proteins, we go far beyond the capabilities of past approaches. We think this is the approach to better cell-specific therapeutics, and to antibody treatments and vaccines that are less prone to the onset of resistance.”
The point, in Ranganathan’s view, is that the biopharmaceutical industry finally has the tools and technologies needed to use AI in commercial drug development. He emphasizes that evolution-based data-driven molecular engineering requires four things: advanced bioinformatics tools for acquiring and curating the input data; powerful deep learning algorithms for learning the rules; DNA synthesis capabilities that are fast, accurate, and cheap; and very high-throughput, high-quality functional assays to enable model retraining.
“With these things in place, one has the basic foundations of the new iterative design process for novel proteins,” Ranganathan argues. It is worth saying that this process lies at the junction of mathematics, computer science, physics, and traditional experimental biology—a combination of skills that is uncommon to say the least. So, a major aspect is training an advanced workforce to execute this new engineering technology.”
The term AI covers a broad spectrum of automated decision-making techniques. They range from those that are based on conditional logic to those that use machine learning.
Deep learning is a related technique. It is also starting to be used in therapeutic protein engineering, thanks in part to the work of software developers such as NVIDIA.
“Deep learning is a subset of machine learning that specifically uses artificial neural networks to enable a computer to learn from data,” explains Kimberly Powell, vice president of healthcare at NVIDIA. “An artificial neural network is a particular arrangement of mathematical operations that were originally biologically inspired, loosely mimicking the connectivity and activation of neurons in the brain. The structure of neural networks, organized in multiple layers, allows them to address complex tasks.
Deep learning models have many potential applications in protein engineering from structure prediction to the assessment of solubility, location, and interactions with other molecules. “Some deep learning models for protein prediction are transformer-based large language models that read the text of amino acids,” Powell notes. These models are large and train on unlabeled amino acid sequences, so there is no need for annotated data.
“The amount of amino acid sequences that we know is very high and growing; however, little is known about the properties of the proteins corresponding to these sequences. Fortunately, deep learning methods based on large language models can help scientists understand proteins and develop therapeutics more quickly. Because protein data is a sequence of letters that represent amino acids, deep learning approaches that have pioneered the natural language breakthroughs of the last five years can also be applied to protein sequence data.”
According to Powell, key facilitators of the biopharmaceutical industry’s adoption of deep learning include reductions in sequencing costs; database resources (like UniProt, a database that contains more than 200 million protein sequences2); language models (such as ESM-1 and ProtT5, which are used to understand protein properties such as cellular location and two-dimensional structure); and graphical toolkits (such as OpenMM3).
Another promising advance is the development of AlphaFold, an AI platform from DeepMind.4 In 2020, AlphaFold won a structure prediction competition, beating rival systems by a significant margin.5 AlphaFold has also increased industry interest in computer-science-based protein engineering. Powell says, “OpenFold is a PyTorch-based reproduction of AlphaFold2 [the next iteration of the original DeepMind technology] that predicts the three-dimensional structures of proteins from their primary amino acid sequences.”
Now that extensive data resources and powerful technology platforms are available, AI and related techniques are bound to be used more frequently in protein design. “Today, companies are starting to bring drugs to market much faster due to deep learning,” Powell observed. She notes that Insilico Medicine has discovered a preclinical therapeutic candidate in under 18 months using an AI-based platform.6 “Biopharmaceutical researchers,” she concludes, “are beginning to transition their workflows to in silico methods to understand proteins faster and bring therapeutics to market.”
Chris Bahl, PhD, president, chief scientific officer, and co-founder of AI Proteins, also expresses optimism about de novo protein design. He believes that it is ready to transition from being a decade-old academic endeavor to being a paradigm-shifting technology in drug development.
“Traditional approaches are limited to editing existing natural proteins,” he says, “but with de novo design, engineers can start building the proteins they want instead of modifying the proteins they have. In short, we have a high level of control. We can solve a lot of the problems that hold back current modalities. Ultimately, we can make medicines safer and more effective. We are no longer limited to tweaking a natural protein to do something it didn’t evolve to do.”
The drug industry’s desire to reduce product development time is another factor likely to increase the use of AI, machine learning, and associated techniques in protein drug development.
Bahl says that “AI is very complementary to high-throughput drug discovery,” and he points out that high-throughput drug discovery may involve the use of robotics, microfluidics, synthetic biology approaches, and next-generation sequencing to test thousands to millions of designed proteins for drug-like activity. “This generates massive datasets that can be used for machine learning,” he stresses. “So, the two tools are highly synergistic.”
AI Proteins uses computer science to engineer “miniproteins.” According to the company, miniproteins combine the “most important, drug-like features of small molecules and antibodies.”
“Miniproteins can solve many issues facing traditional antibody development, acting to drive down costs, speed up therapeutic development, and improve success rates,” Bahl elaborates. “Our high-throughput platform is capable of producing molecules ready for preclinical studies at unprecedented speed. Partnering with others will help us realize the full potential of this platform and use it to bring as much good to the world as possible.”
Future proofing with AI
Industry’s willingness to use computer-science-based techniques and technologies in protein design may be indicative of wider changes. “Biology is now making a transition from an analytical, tinkering enterprise to a formal engineering discipline,” says Evozyne’s Ranganathan. It is capable of creating novel natural machines that rival and even exceed the performance of man-made devices.
“Evolution-based, data-driven molecular engineering processes are what people will use to solve complex problems in standard protein-based therapeutics. In addition, these processes will extend to controlling the emergence of new infectious diseases; to make future-proofed vaccines that are robust to the evolution of viruses, bacteria, and cancer cells; and to enable powerful site-specific gene editing.”
“The future lies in designing biology to produce natural machines to solve many real-world problems,” Ranganathan declares. “From the perspective of therapeutics, there is no doubt that the pharmaceutical industry must and will adopt the new data-driven methods as part of their discovery process.”
1. Anfinsen CB. Principles that govern the folding of protein chains. Science 1973; 181(4096): 223–230. DOI: 10.1126/science.181.4096.223.
2. Rives A, Meier J, Sercu T, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. USA 2021; 118(15): e2016239118. DOI: 10.1073/pnas.2016239118.
3. Pandey M, Fernandez M, Gentile F, et al. The transformational role of GPU computing and deep learning in drug discovery. Nat. Mach. Intel. 2022; 4: 211–221. DOI: 10.1038/s42256-022-00463-x.
4. Jumper J, Evans R, Pritzel A, et al. Highly accurate protein structure prediction with AlphaFold. Nature 2021; 596(7873): 583–589. DOI: 10.1038/s41586-021-03819-2.
5. Callaway E. ‘It will change everything’: DeepMind’s AI makes gigantic leap in solving protein structures. Nature 2020; 588: 203–204. DOI: 10.1038/d41586-020-03348-4.
6. Insilico Medicine. From Start to Phase 1 in 30 Months: AI-Discovered and AI-Designed Antifibrotic Drug Enters Phase I Clinical Trial. Published February 24, 2022. Accessed January 11, 2023.