AI for Drug Discovery: How AlphaFold Reinvented Biology
Introduction
Drug development is one of humanity's most expensive and failure-prone endeavours. A new medicine typically takes 10 to 15 years to reach patients, costs over $1 billion in research and development, and has a greater than 90 percent chance of failing in clinical trials despite looking promising in the laboratory.
That bottleneck has a root cause: biology is extraordinarily complex, and until recently our ability to model it computationally was primitive. Researchers would spend years just figuring out what a single protein looked like in three dimensions before they could even begin designing a drug to interact with it.
Then, in 2020, DeepMind's AlphaFold 2 solved what biologists had called the protein folding problem, a challenge that had resisted 50 years of scientific effort. The solution was a deep learning model. In 2024, Demis Hassabis and John Jumper shared the Nobel Prize in Chemistry for this work, alongside David Baker of the University of Washington for his independent contributions to computational protein design.
This article explains why drug discovery is fundamentally a machine learning problem, how AlphaFold works, what generative chemistry is doing to design molecules from scratch, and which AI-discovered drugs are already in human clinical trials.
Why Drug Discovery Is a Machine Learning Problem
At its core, drug discovery is a search problem, and the search space is incomprehensibly large.
- Chemical space is the term for the set of all possible drug-like molecules. Estimates put this number at around 1060 candidates. The entire history of pharmaceutical research has sampled only a tiny fraction of this space through physical experimentation. No brute-force physical search is possible; intelligent navigation requires prediction models.
- Protein structure determines drug function. A drug works by binding to a target protein and altering its activity. To design a drug that fits a protein precisely, like a key in a lock, you must know the protein's three-dimensional shape. That shape is determined by the protein's amino acid sequence, but predicting three-dimensional shape from a one-dimensional sequence involves exploring a combinatorial space that makes chess look simple.
- Biological data is enormous and noisy. Modern drug discovery generates petabytes of data across genomics, proteomics, clinical trial results, and electronic health records. Identifying which patterns predict drug efficacy or toxicity requires machine learning to extract signals from noise at a scale no human analysis team can match.
The Drug Discovery Pipeline
Understanding where AI helps requires understanding the pipeline it is disrupting. Each stage adds years to the timeline and offers a different surface area for computational acceleration.
| Stage | What Happens | Typical Duration | Where AI Is Active |
|---|---|---|---|
| Target Identification | Identify the protein or pathway responsible for a disease | 1 to 2 years | Genomics analysis, knowledge graphs, literature mining |
| Hit Finding | Screen millions of compounds to find candidates that interact with the target | 1 to 2 years | Virtual screening, molecular docking, GNN property prediction |
| Lead Optimisation | Refine the best hits into drug candidates with good ADMET properties | 2 to 3 years | Generative chemistry, reinforcement learning, multi-objective optimisation |
| Preclinical Testing | Test in cell cultures and animal models for safety and efficacy | 1 to 2 years | In-silico toxicity prediction, synthetic route planning |
| Clinical Trials (Phase 1 to 3) | Test in humans for safety, dosing, and efficacy | 6 to 10 years | Patient stratification, biomarker discovery, adaptive trial design |
AI is currently most mature in the early discovery stages, from target identification through lead optimisation, where the problem is computational rather than biological and the feedback loop between prediction and experimental validation is measured in days rather than months.
AlphaFold: Solving the 50-Year Protein Folding Problem
What Is Protein Folding?
Every protein is a chain of amino acids, a sequence that looks like a string of 20 possible letters. This chain spontaneously folds into a specific three-dimensional shape in milliseconds, driven by the thermodynamic interactions between amino acids and the surrounding water molecules. That shape determines everything the protein does: which molecules it binds, which reactions it catalyses, which cellular signals it transmits.
The problem is that predicting the 3D shape from the 1D sequence is combinatorially explosive. A protein of 100 amino acids can theoretically adopt an astronomical number of conformations, and the correct one is not obvious from the sequence alone. Physically determining the structure requires expensive laboratory techniques like X-ray crystallography and cryo-electron microscopy that take months to years per protein. As of 2020, only about 170,000 protein structures had ever been experimentally solved, out of hundreds of millions of known protein sequences.
AlphaFold 2's Approach
AlphaFold 2 uses a deep learning architecture that combines two powerful ideas to crack this problem.
The first idea is multiple sequence alignment. The model analyses how the target protein's sequence has evolved across hundreds of species. Amino acid positions that co-evolve across species, meaning when one changes the other tends to change as well, are physically close to each other in 3D space because they interact. This evolutionary signal encodes geometric constraints without any explicit physics simulation.
The second idea is equivariant attention. AlphaFold 2 uses a specialised form of the transformer attention mechanism that is equivariant to rotations and translations in 3D space, meaning the model's output changes correctly when you rotate or translate the input. This allows it to directly predict 3D atomic coordinates rather than working through an intermediate 2D representation.
At the CASP14 competition in 2020, AlphaFold 2 achieved a median backbone accuracy of 0.96 angstroms, essentially matching the precision of expensive laboratory methods. The second-best competitor scored 2.8 angstroms. The biological community described the result as a decade-long problem solved in a single paper.
AlphaFold Database and AlphaFold 3
DeepMind then released AlphaFold 2's structure predictions for over 200 million proteins, essentially every known protein sequence, in a freely accessible database. This expanded the available structural knowledge by more than a thousandfold overnight and immediately changed how researchers approached every stage of drug discovery.
AlphaFold 3, released in 2024, extended the model to predict protein-ligand interactions, meaning how a drug molecule docks inside a target protein's binding pocket. This makes it directly usable for drug design, not just structural biology, and reduces the need for expensive laboratory docking experiments.
Generative Chemistry: Designing Molecules From Scratch
Predicting whether a molecule works is one problem. Designing new molecules that will work is a considerably harder one. This is the domain of generative chemistry, and it is where the most commercially exciting AI drug discovery work is happening today.
The challenge is multi-objective optimisation. A useful drug candidate must simultaneously satisfy a long list of conflicting constraints. It needs high binding affinity to its target protein, selectivity (meaning it should not bind to off-target proteins that cause side effects), good ADMET properties (absorption, distribution, metabolism, excretion, and toxicity), and synthesisability, meaning a chemist must actually be able to manufacture it in a laboratory. Improving one property often degrades another, so the goal is to find candidates that sit on the Pareto frontier across all dimensions.
Reinforcement Learning for Molecular Optimisation
One of the most practically successful generative chemistry approaches is reinforcement learning applied to molecular construction. The idea is to frame drug design as a sequential decision problem: an agent starts with a partial molecule and at each step chooses which atom or functional group to add, modifying the structure incrementally. A reward signal, computed by property prediction models, scores each completed molecule on its predicted binding affinity, drug-likeness, and synthesisability.
AstraZeneca's REINVENT system, one of the most widely used tools in computational drug design, works on this principle. It trains a recurrent neural network to generate SMILES strings (a text representation of molecular structure) and uses reinforcement learning to steer the generator toward regions of chemical space that score well on the reward function. Chemists can tune the reward by adjusting the weights they give each property, effectively having a conversation with the model about what trade-offs matter most for their particular project.
Diffusion Models for Molecular Design
The same diffusion model framework that generates photorealistic images has been adapted to molecular generation. RFDiffusion, developed at the University of Washington, applies diffusion to protein backbone design. It generates novel protein structures with desired geometric properties, such as a binding pocket of a specific shape and size, which can then be expressed biologically. RFDiffusion has been used to design proteins that do not exist in nature and that bind to therapeutic targets with high affinity. In 2023, the same team used it to design binders for flu haemagglutinin, which is a potential vaccine target.
Graph Neural Networks for Property Prediction
Molecules are graphs: atoms are nodes and chemical bonds are edges. Graph Neural Networks are architectures designed to learn on graph-structured data, which makes them a natural fit for molecular property prediction.
A GNN for molecules uses a process called message passing. At each layer, every atom aggregates information from its bonded neighbours: it collects their feature vectors, applies a learned transformation, and updates its own representation. After several rounds of message passing, each atom's embedding reflects not just its own element type but the chemical environment several bonds away. A final readout function aggregates all atom embeddings into a molecule-level vector, which is then passed to a prediction head.
This architecture is trained to predict properties that are expensive to measure experimentally, including binding affinity to a specific protein, aqueous solubility (which affects how well the drug is absorbed), metabolic stability (how quickly the liver breaks it down), and hERG channel inhibition (a measure of cardiac toxicity risk). Accurate property predictors act as fast surrogate models that can screen millions of candidate molecules in seconds rather than the days it would take to measure them in the lab.
Real AI-Discovered Drugs in Clinical Trials
AI-assisted drug discovery is no longer theoretical. Multiple compounds that were identified, designed, or substantially optimised by machine learning are currently in human clinical trials.
| Drug / Company | Disease | AI Role | Trial Status (2026) |
|---|---|---|---|
| ISM001-055 / Insilico Medicine | Idiopathic pulmonary fibrosis (IPF) | Target identified by generative AI; molecule designed and optimised by AI end-to-end | Phase 2 (first fully AI-designed drug in human trials) |
| DSP-1181 / Exscientia & Sumitomo Dainippon | Obsessive-compulsive disorder (OCD) | AI-optimised candidate reached Phase 1 in 12 months vs. the typical 4 to 5 years | Phase 1 (completed 2020; one of the earliest AI-designed molecules in human trials) |
| REC-994 / Recursion Pharmaceuticals | Cerebral cavernous malformation | Phenotypic screening of millions of compounds using AI image analysis to identify hits | Phase 2 |
| BNT111 / BioNTech | Melanoma (shared tumour-antigen mRNA vaccine) | mRNA encodes four shared tumour-associated antigens; AI assists in sequence optimisation and antigen selection (FixVac platform) | Phase 2 |
| Halicin / MIT (academic) | Drug-resistant bacterial infections | A graph neural network screened 107 million molecules and identified a novel antibiotic class | Preclinical; published in Cell (2020) |
Code Example: Molecular Fingerprints with RDKit
The following example shows how to convert a molecule (represented as a SMILES string) into a Morgan fingerprint and use it to predict a property with scikit-learn. This is the foundation of what industrial virtual screening pipelines do at scale, though they use larger datasets and graph neural networks rather than random forests.
from rdkit import Chem
from rdkit.Chem import AllChem
import numpy as np
from sklearn.ensemble import RandomForestClassifier
def smiles_to_fingerprint(smiles: str, radius: int = 2, nbits: int = 2048):
mol = Chem.MolFromSmiles(smiles)
if mol is None:
return None
fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=radius, nBits=nbits)
return np.array(fp)
# Predict blood-brain barrier (BBB) permeability from structure
molecules = [
"CC(=O)Oc1ccccc1C(=O)O", # aspirin
"c1ccc2c(c1)cc1ccc3cccc4ccc2c1c34", # benzo[a]pyrene
"CN1C=NC2=C1C(=O)N(C(=O)N2C)C", # caffeine
]
labels = [1, 0, 1] # 1 = BBB-permeable, 0 = not permeable (illustrative)
X = np.array([smiles_to_fingerprint(s) for s in molecules])
y = np.array(labels)
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X, y)
# Predict for a new molecule
new_mol = "CC12CCC3C(C1CCC2O)CCC4=CC(=O)CCC34C" # testosterone
fp = smiles_to_fingerprint(new_mol).reshape(1, -1)
pred = clf.predict(fp)[0]
print(f"Predicted BBB permeability: {'Permeable' if pred == 1 else 'Not permeable'}")
In production systems, this fingerprint approach is extended with graph neural networks that learn task-specific molecular representations rather than using fixed fingerprints, and trained on databases like ChEMBL (2.4 million bioactivity records) or PubChem (116 million compounds). The key advantage of GNNs over fingerprints is that they can learn which structural features matter for each specific property, rather than treating all substructures equally.
Limitations and Challenges
AI is not a cure for the fundamental difficulty of drug development. Several deep challenges remain that computational methods have not yet meaningfully addressed.
In-silico to in-vivo translation is the central unsolved problem. A molecule that looks perfect in a computational model can fail completely when it encounters a living system. Biology is messy in ways that models do not capture: cell membranes are selective barriers, metabolic enzymes degrade some compounds faster than others, compensatory cellular pathways respond to perturbations, and individual genetic variation means the same drug can work differently in different people. No computational model fully represents this complexity.
Predicting clinical success remains beyond current AI capability. The 90 percent failure rate in clinical trials is dominated by failures in Phase 2 and 3, where drugs that looked effective in cell cultures and animal models simply do not work in humans. AI has significantly accelerated pre-clinical discovery, but it cannot yet predict which drug candidates will pass human trials. The variables that matter most in Phase 3, including patient heterogeneity, off-target effects at therapeutic doses, and long-term safety, are not well represented in any training dataset.
Data quality and IP barriers constrain what models can learn. The highest-value bioactivity data, from pharmaceutical companies' internal high-throughput screening libraries, is proprietary and not shared. Models trained on public databases like ChEMBL are therefore trained on a biased subset of known bioactivity space, and may not generalise well to the novel chemical territories that generative models are trying to explore.
What's Next
Protein language models are the most active frontier in computational biology. Large transformer models trained on protein sequences in the same way GPT was trained on text are enabling capabilities that traditional bioinformatics tools could not. Meta AI's ESM-2, available in sizes from 8 million to 15 billion parameters, can predict protein function and structural properties from sequence alone without requiring multiple sequence alignment. The ESMFold model uses ESM-2 embeddings to predict 3D structure at speeds that are 60 times faster than AlphaFold 2, enabling rapid screening of millions of sequences. Salesforce's ProGen generates novel protein sequences with user-specified functional properties, essentially functioning as a protein designer guided by natural language.
Personalised medicine is moving from concept toward practice. Personalised cancer vaccines, designed individually for each patient based on the specific mutation profile of their tumour, require AI to identify which mutated peptides will be recognised by the patient's immune system (neoantigen prediction) and to optimise the mRNA sequence encoding those peptides for stability and immune response. BioNTech and Moderna both have personalised cancer vaccine programmes in Phase 2 trials that rely on AI for this process.
Digital cell twins represent the longer-term vision. Rather than predicting individual molecular properties, researchers are beginning to build computational models of entire cell signalling networks that can simulate how a cell responds to drug treatment. Recursion Pharmaceuticals has built a dataset of 50 million cell images from drug-treated cells and is training models that can predict cellular phenotypic response to novel compounds. If successful, this would allow researchers to ask not just whether a molecule binds its target but what it does to the cell as a whole system.
The 12-year drug development timeline will not collapse to 12 months. Clinical validation of safety and efficacy in humans will always require time. But AI is compressing the pre-clinical stages in ways that were not possible five years ago, and the first wave of AI-designed drugs entering Phase 2 trials will produce the human efficacy data that either validates or calibrates the field's current optimism.
Key Takeaways
- Drug discovery is fundamentally a prediction and search problem across vast chemical and biological spaces, exactly the kind of problem machine learning is suited to at scale.
- AlphaFold 2 solved protein structure prediction at experimental accuracy and made 200 million structure predictions publicly available; Hassabis and Jumper were awarded the 2024 Nobel Prize in Chemistry for this work.
- Generative chemistry approaches including reinforcement learning, diffusion models, and GNNs can now design novel molecules optimised simultaneously for binding affinity, ADMET properties, and synthesisability.
- AI-designed drugs are in Phase 2 clinical trials as of 2026, with Insilico Medicine's ISM001-055 as the first fully AI-designed molecule to reach human testing.
- The remaining bottleneck is clinical validation. AI accelerates discovery dramatically but cannot yet predict which molecules will succeed in human trials.
References
- Jumper, J., et al. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596, 583–589.
- Stokes, J.M., et al. (2020). A Deep Learning Approach to Antibiotic Discovery. Cell, 180(4), 688–702.
- Zhavoronkov, A., et al. (2019). Deep learning enables rapid identification of potent DDR1 kinase inhibitors. Nature Biotechnology, 37, 1038–1040.
- Watson, J.L., et al. (2023). De novo design of protein structure and function with RFdiffusion. Nature, 620, 1089–1100.
- Abramson, J., et al. (2024). Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature, 630, 493–500.
- Bender, A., & Cortés-Ciriano, I. (2021). Artificial intelligence in drug discovery: what is realistic, what are illusions? Drug Discovery Today, 26(2), 511–524.