The protein folding puzzle
Protein folding is something that occurs naturally so that proteins become biologically functional, but it’s a complex process that sometimes fails. For decades, scientists have been trying to find a method to reliably predict a protein’s structure from its sequence of amino acids so we can better understand how proteins work.
The challenge? There are over 200 million known distinct proteins. Each one has a unique 3D shape that determines how it works and what it does. Because there are so many sequences and determining their 3-D structure experimentally is so time-consuming and expensive, scientists only know the exact structure of a tiny fraction of the proteins. And these experimental methods still fall far short of reliable statistical accuracy.
Deepmind’s gigantic leap
In 2020, Alphabet’s artificial intelligence research arm, DeepMind, made a massive breakthrough in predicting protein structures using a deep learning model called AlphaFold.
AlphaFold is trained on publicly available data consisting of about 170,000 protein structures, and is the first computational method that can regularly predict the 3D shape of a protein, at scale with a high degree of accuracy.

Open source predictions available on Google Cloud
Together, Google Cloud and Deepmind have released this dataset of predicted protein structures for plants, bacteria, animals, and other organisms as part of the Google Cloud Public Dataset program to enable bulk downloads at no cost. That means you can also create custom queries of the dataset using BigQuery!

Running AlphaFold on Google Cloud Vertex AI
Let’s say you want to run AlphaFold on your own in order to get protein structure predictions against your own set of data. There are a few challenges to keep in mind:
- You need to set up feature engineering against genetic sequence databases
- Preprocess data
- And run those inputs against pre-trained models

Try it out first using Vertex AI Workbench
For those of you who want to try out a simplified version of AlphaFold, we have a Colab notebook that uses no templates (homologous structures) and a selected portion of the BFD database. You can deploy right on Vertex AI Workbench, which lets you specify a custom container image that we’ve already created for you. You’ll be able to:
- Configure access to genetic databases
- Configure GPU acceleration
- Search against genetic databases
- Use the pre-processed results as inputs to the AlphaFold model locally


Run hundreds of experiments reliably using Vertex AI Pipelines
For organizations that want to run a full blown version of AlphaFold for many protein folding experiments a week, you’ll want an ML pipeline orchestrator. The AlphaFold Batch Inference solution is a set of code samples that uses Vertex AI Pipelines to support hundreds of concurrent inference pipelines with higher throughput to help you run experiments at scale. The solution uses Vertex AI Pipelines as an orchestrator and runtime, Vertex ML Metadata for metadata and artifacts, and Cloud Filestore to manage databases.

Click to enlarge


Now go forth and save the world!
Okay maybe that’s a bit hyperbolic, but this is inspiring stuff! What started as a 50 year challenge, to the discovery of AlphaFold, to being able to run it on Google Cloud, researchers, developers, and science enthusiasts now have access to one of the most pivotal advancements in the medical world. Even a non-specialist can easily use a Vertex AI notebook to exercise a simplified version of AlphaFold. The next answers to the mysteries of life and discovery of disease treatments have never felt more attainable. With these no-cost solutions to run AlphaFold on Vertex AI and the Public Dataset, you can help propel us in this worldwide endeavor.
Learn more about healthcare and life sciences solutions on Google Cloud here.
If you have feedback or want to share your experience with me, reach out to me at @stephr_wong.
By: Stephanie Wong (Developer Advocate)
Source: Google Cloud Blog