AlphaFold 2: a new AI achievement

Antonin

December 3, 2020 · 6 min read

The last 30th of November, DeepMind published on their website an article about AlphaFold 2, an Artificial Intelligence that solves with 92.4 GDT, on simple protein structure, a 50 years old grand challenge: the protein folding problem.

DeepMind released the first version of AlphaFold in 2018.
This first version achieved a score of 58 in the Global Distance Test (GDT) at the Critical Assessment of Structure Prediction (CASP) competitions, which was the highest score at that time.
For this occasion, an article has been published on Nature, and the source code publicly available on Github. With version 2, DeepMind’s dedicated team performed a median score of 92.4 GDT overall across all targets on the same competition dataset, and a median score of 87.0 GDT for the very hardest protein targets.
Almost 30 more GDT in two years, a huge improvement between those two years.

In this blog post, I will decompose this breakthrough into several parts: talk about the protein folding problem, the competition and the achievements before AlphaFold, and very quickly about both AlphaFold 1 and 2.

Protein? Folding? What? #

A protein is a macromolecule (= big molecule) that performs a vast array of functions within organisms, from transporting molecules to DNA replication[1].
In short: proteins are essential to life.
Like every other chain of molecules - a protein is composed by amino acids - two proteins are different by their content, or composition.
However, something is specific and unique to a protein: its 3D structure!

Indeed, a protein with the same amino acids than a second one have two different functions if their 3D structure (or their physical shape in space) vary, as a shape can deliver, or hide, an activation section for every other molecule around in order to interact specifically with.
Even if people are still fighting again on asking ourselves if we can simplify the sequence similarity measure between two proteins as a structural similarity, we will assume that the 3D structure of a protein determines its own function within an organism.

The structure of a protein is, almost all the time, very complex to understand, as it depends of four different levels of complexity: hydrophobic interactions, hydrogen bonds, the Van der Waals forces between nearby amino acids, and di-sulphide connections between cysteine molecules.

To test all the possible folds of a protein using brute force methods is not possible then, as, based on the Levinthal’s paradox, it could be more than 10^143 ways to fold a single protein, which is way more than playing chess (10^120) but less than all the possible moves in the game of Go (10^360).
Obviously, misfolding a protein has way more implications than making a wrong move in a Go game…

Ribbon diagram of the 3D structure of hemoglobin - credits to Azom.com[2]

Unfortunately, a protein can be modified, or misfold, and lead to strong issues in the organism…
Indeed, an accumulation of misfolded proteins - proteins that contain at least one structure modifications - can lead one or multiple disease(s), sometimes neurodegenerative like Alzheimer’s disease, or Huntington’s disease.

So, reducing the problem of identifying a protein by its amino-acids sequence can help the scientists to design a specific protein for a drug, or to understand how a modification, or mutation, in a protein can cause a specific disease.

Competitions, researches and results #

Every two years a community-wide worldwide experiment for protein structure prediction, called CASP, is launching global competitions to review the recent scientific researches and results on protein folding. This review is done by computing the GDT, or Global Distance Test using a complex dataset of 200 million of proteins.
A GDT is a measure of similarity between two protein structures with identical amino acid sequences and different shapes - in short, a prediction accuracy for proteins shape.
Claimed by the Science community, the higher total score of GDTs, the better a given model is in comparison to reference structure.

The CASP competitions for protein folding are performed since 1994, and new algorithms and results are evaluated by this community every two years.

CASP results, 30th November of 2020 - credits to DeepMind[3]

The AlphaFold breakthrough #

CASP 13, in 2018, has been won by the first version of AlphaFold, with 58 GDT, which was the highest score at this time.
This year, CASP 14 has been won by the second version of AlphaFold with nearly 90 GDT.

The secret of the second version of AlphaFold is not known yet, but, as stated by Mirko Torrisi, Gianluca Pollastri, and Quan Le, the secret of the first version is known as a distance map predictor, implemented as a very deep convolutional neural networks, trained on more than 170_000 proteins by 200 GPUs.[4][5][6]

AlphaFold 1 has two steps in the process:

a Convolutional Neural Network (CNN), that takes

at least - as input the amino acids residue sequences, and output the distance metrics between the amino-acids in space,

a Gradient descent optimisation of folding the 3D structure to match the distances between amino-acids in the protein.

AlphaFold 1 architecture - credits to DeepMind[6]

As stated by the team of AlphaFold, the second version of the model may be significantly different from the first one, as the first version has a (light) tendency of overfit the data.

People tend to think that the second version replaced the CNNs with transformers, an attention-based neural network, and speculate on more accurate feature engineering of amino acids residue sequences.

What’s next ? #

“In some sense the problem is solved.” John Moult

AlphaFold will have implications for areas like drug design or treatments, biomaterials, environmental sustainability - like designing enzymes to eat the plastic in our oceans - or agriculture.
Indeed, AlphaFold could be used to understand how protein structure can lead to specific diseases, how to identify proteins that have malfunctioned, but also learn unknown functions of genes encoded in DNA.

Both AlphaFold versions are the work of incredible talented engineers and researchers at DeepMind, but also of all the researchers who participated since the last fifty decades at least.
So, congratulations to John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Kathryn Tunyasuvunakool, Olaf Ronneberger, Russ Bates, Augustin Žídek, Alex Bridgland, Clemens Meyer, Simon A. A. Kohl, Anna Potapenko, Andrew J. Ballard, Andrew Cowie, Bernardino Romera-Paredes, Stanislav Nikolov, Rishub Jain, Jonas Adler, Trevor Back, Stig Petersen, David Reiman, Martin Steinegger, Michalina Pacholska, David Silver, Oriol Vinyals, Andrew W Senior, Koray Kavukcuoglu, Pushmeet Kohli, Demis Hassabis, and all the other engineers and researchers who have been involved in any way in this subject.

As stated in their blog post, DeepMind is pretty optimistic about the impact AlphaFold can have on the wider world… I am too!
To me, AlphaFold is another proof that AI can be beneficial for humanity.

Five years after Alphago, DeepMind just solved another big challenge… Who knows which big challenge DeepMind will solve in 2025!

A journey into a wild pointer

AlphaFold 2: a new AI achievement

Protein? Folding? What? #

Competitions, researches and results #

The AlphaFold breakthrough #

What’s next ? #

Online resources #