Introducing Chroma
is a generative model that creates new protein molecules based on geometric and functional programming instructions.
Chroma learns patterns in the three-dimensional structures and amino acid sequences of proteins and protein complexes from the Protein Data Bank. By learning these patterns in a way that generalizes across natural proteins, Chroma can synthesize new protein molecules that adhere to these principles while combining them in novel ways. Importantly, Chroma can be conditioned on a set of desired structural or functional properties, such as the presence of functional structural motifs, symmetry constraints, adhering to a pre-specified shape, belonging to a domain or functional class, or even satisfying text-based descriptions. We think that systems such as Chroma will enable a new, programmable mode of protein engineering where it is routine and feasible to generate specific and tailored protein solutions to complex challenges for bioengineering and human health.
The Chroma system is built on several new machine learning components, including a new neural network architecture for processing and manipulating 3D molecular information, a new diffusion process for adding noise to structures while adhering to the biophysical constraints of protein chains, and a new generalized method for generating high-quality samples from diffusion models. As a result of these innovations, Chroma is able to generate extremely large proteins and protein complexes (e.g. 30,000+ heavy atoms across 4,000+ residues) in a few minutes on a single commodity GPU.
![](https://generate-biomedicines-dev.imgix.net/assets/figure2_protein_explainer_dark.jpg?auto=format&crop=focalpoint&domain=generate-biomedicines-dev.imgix.net&fit=crop&fp-x=0.5&fp-y=0.5&h=800&ixlib=php-3.3.1&q=82&w=800&s=188b99a9bbe41791f8ff169761559761 800w, https://generate-biomedicines-dev.imgix.net/assets/figure2_protein_explainer_dark.jpg?auto=format&crop=focalpoint&domain=generate-biomedicines-dev.imgix.net&fit=crop&fp-x=0.5&fp-y=0.5&h=1600&ixlib=php-3.3.1&q=61&w=1600&s=e1ce46735cf441259d3d709367b8627e 1600w, https://generate-biomedicines-dev.imgix.net/assets/figure2_protein_explainer_dark.jpg?auto=format&crop=focalpoint&domain=generate-biomedicines-dev.imgix.net&fit=crop&fp-x=0.5&fp-y=0.5&h=680&ixlib=php-3.3.1&q=82&w=680&s=01c3455e1500bb3b810ba42fee4290c0 680w, https://generate-biomedicines-dev.imgix.net/assets/figure2_protein_explainer_dark.jpg?auto=format&crop=focalpoint&domain=generate-biomedicines-dev.imgix.net&fit=crop&fp-x=0.5&fp-y=0.5&h=1360&ixlib=php-3.3.1&q=61&w=1360&s=997f026cdd90bba1027b0043c0026887 1360w, https://generate-biomedicines-dev.imgix.net/assets/figure2_protein_explainer_dark.jpg?auto=format&crop=focalpoint&domain=generate-biomedicines-dev.imgix.net&fit=crop&fp-x=0.5&fp-y=0.5&h=520&ixlib=php-3.3.1&q=82&w=520&s=11c3c6315e79017e0b2a49683d57e2f6 520w, https://generate-biomedicines-dev.imgix.net/assets/figure2_protein_explainer_dark.jpg?auto=format&crop=focalpoint&domain=generate-biomedicines-dev.imgix.net&fit=crop&fp-x=0.5&fp-y=0.5&h=1040&ixlib=php-3.3.1&q=61&w=1040&s=8700d66c609bd06208a2e747b1ca76c9 1040w, https://generate-biomedicines-dev.imgix.net/assets/figure2_protein_explainer_dark.jpg?auto=format&crop=focalpoint&domain=generate-biomedicines-dev.imgix.net&fit=crop&fp-x=0.5&fp-y=0.5&h=370&ixlib=php-3.3.1&q=82&w=370&s=183934bd1b376ad15174e05568370ee9 370w, https://generate-biomedicines-dev.imgix.net/assets/figure2_protein_explainer_dark.jpg?auto=format&crop=focalpoint&domain=generate-biomedicines-dev.imgix.net&fit=crop&fp-x=0.5&fp-y=0.5&h=740&ixlib=php-3.3.1&q=61&w=740&s=ce8a3130fddac426b5c2111d30f2c53a 740w)
What is a protein?
Proteins are the “doers” of cells, responsible for much of the work that happens in the biological world.
Proteins are formed from a linked sequence of building blocks known as amino acids. Because there are a total of 20 unique “letters” in the amino-acid alphabet, an unimaginably large number of unique proteins are possible, each encoded by a specific sequence of amino acids. While it’s easy to think of proteins as a string of letters, in the cell they fold into 3‑dimensional shapes that perform specific biological functions. A partial list of the feats performed by proteins include replicating DNA, fighting off invading viruses, and keeping your cells alive and healthy. Even things you might not think of as being protein functions, like our sense of smell or sight, are, in fact, enabled by proteins! Given how capable this molecular class is, it is not surprising that most of the medicines approved today are, in fact, proteins. Thus, the better we understand how proteins work and the more effectively that we can generate new ones for targeted functions, the more effective will be the medicines of tomorrow.
Sampling from the protein universe
Having learned common principles from natural proteins, Chroma can generate random proteins without any additional prompting.
Below is a small set of single-chain proteins sampled from the model.
![](https://cdn.generatebiomedicines.com/assets/single_chain_collage.jpg)
Sampling protein complexes
Many proteins carry out their functions through interactions with other proteins in multi-molecule assemblies called complexes.
These complexes can transmit information, catalyze important chemical reactions, or act as complex molecular machines and are frequently the targets of therapies. Chroma can directly generate protein complexes composed of many proteins of different shapes. Below are a few examples of complexes generated by Chroma.
![](https://cdn.generatebiomedicines.com/assets/multichain_collage.jpg)
Making giants
Typical proteins are composed of tens to thousands of amino acids. Chroma’s efficient computational scaling makes it possible to directly generate large molecular assemblies at the scales frequently seen in nature.
![](https://cdn.generatebiomedicines.com/assets/big_complexes.jpg)
Programming proteins with Chroma
Chroma has the ability to incorporate a wide range of properties and constraints to steer the generative process. We imagine that capabilities like this will enable a future in which scientists can specify desired protein functions or properties in a high-level language and allow Chroma to compile these attributes in a lower-level “executable” version of the molecular function in the form of a 3D protein molecule.
In software engineering, the transition from low-level machine and assembly codes to high-level languages such as C++/Java/Python sparked an intense period of innovation in which developers could build complicated programs from simpler and reliable abstractions. In a similar way, we believe that generative models such as Chroma offer a step towards a higher level programming abstraction for biology. Below we show an early version of what we imagine this could look like.
Symmetry groups
Many protein complexes in nature are built from symmetrical “tilings” of one or more protein building blocks.
Chroma can be conditioned on many different kinds of symmetry, from simple circular symmetries to the complex icosahedral symmetries often seen in nanoparticles.
Protein infilling
A routine challenge in protein engineering is to keep part of a protein molecule that is important for one property (such as folding and interacting with the immune system) fixed while changing another part of the protein (e.g. to bind to a target).
In the example below an antibody heavy chain CDR region is being reimagined by the model. The top left image shows a true structure of an antibody-protein complex. In the remaining 3 images, the loops are removed and re-generated from scratch.
![](https://cdn.generatebiomedicines.com/assets/Infilling.jpg)
Semantic conditioning
Chroma can be conditioned by other neural networks that know about proteins without retraining. This makes it possible to control higher-level properties of properties, i.e. their semantics.
In the example below, we bias Chroma sampling with two different neural networks, one which was trained to predict CATH folds from structure, and the other of which was trained to predict natural language captions from protein 3D structures. Each column represents a particular conditioning example. The leftmost two columns are conditioned with the CATH classification model. The rightmost two columns are conditioned with the protein captioning model. The top row of structures are random samples that the model generated without conditioning. The middle row is the sample drawn from Chroma with conditioning. Finally, the bottom row shows a real example belonging to the desired class or caption. We can see how the classifier drives the samples (middle row) to look more similar to real examples (bottom row) than unconditioned (top row). Capabilities such as this should make high level functional programming more routinely feasible.
![](https://cdn.generatebiomedicines.com/assets/semantic_conditioning_fig.png)
Shape control
What are the limits on protein shape? We can also ask Chroma to sample 3D structures given arbitrary shape specifications. Below we asked for proteins consistent with the Latin alphabet and numeral system.
Transforming between protein structures
Since Chroma parameterizes the space of possible 3D protein structures in a continuous way, we can ask what it thinks is in between two different structures.
These morphs shed light on how Chroma organizes structure space in its internal representations.
![](https://cdn.generatebiomedicines.com/assets/interp1_640.gif)
Morphing between secondary structures
We morph between a highly alpha-helical protein (pheromone from marine ciliate Euplotes raikovi, PDB 6E6N) and a beta-rich protein (toxin from scorpion Mesobuthus martensii PDB 6AY8).
![](https://cdn.generatebiomedicines.com/assets/interp2_640.gif)
A bigger alpha to beta morph
This morph transforms between rhodopsin, a transmembrane protein found in rod cells in the eye (PDB 2I35), and a six-bladed beta propeller structure (PDB 3DAS).
![](https://cdn.generatebiomedicines.com/assets/interp3_640.gif)
Parallel-to-anti-parallel transition
Coiled coils are a common structural motif involving alpha-helices that wind around one another. This morph shows a transition between a parallel (PDB 2ZTA) and an anti-parallel coiled coil (PDB 1HF9).
![](https://cdn.generatebiomedicines.com/assets/interp4_450.gif)
Conformational shifts in a transporter
Leucine transporter (LeuT) helps move the amino acid leucine across cell membranes via a dynamics conformational change of opening and closing. Here we interpolate between three experimentally-determined conformations: the outward-open state (PDB 3TT1), the occluded state (PDB 3F3E), and the inward-open state (PDB 3TT3).
![](https://cdn.generatebiomedicines.com/assets/interp5_7_450.gif)
Viral fusion-enabling conformational shifts
Hemagglutinins are spike-shaped proteins on the surfaces of viruses such as influenza that drive fusion of the virus with the host cell. Here we morph between three different functional conformational states of this fusion process of Influenza hemagglutinin: state I (PDB 6Y5H), state II (PDB 6Y5I), and state IV (PDB 6Y5K). We can see the interior alpha helices retract to drive fusion with the host membrane.
The Details
![](https://cdn.generatebiomedicines.com/assets/fig1_schematic_v1.jpg)
This model was made possible by several novel components
- Programable protein generation via a collection of new conditioning models
- Random Graph Neural Networks. A novel neural network architecture that can process and modify 3D molecular systems in time that scales sub-quadratically with the whole system.
- Polymer diffusion. A novel diffusion process that respects the biophysical constraints of proteins as collapsed polymers
- Low-temperature Sampling. A novel sampling algorithm for generating high-likelihood samples from diffusion models
![](https://generate-biomedicines-dev.imgix.net/assets/Blog_All_boxplot_2022-11-29-160928_fxdu.png?auto=format&crop=focalpoint&domain=generate-biomedicines-dev.imgix.net&fit=crop&fp-x=0.5&fp-y=0.5&h=807&ixlib=php-3.3.1&q=82&w=800&s=a6816850ed91d08e67be4ff7568bc529 800w, https://generate-biomedicines-dev.imgix.net/assets/Blog_All_boxplot_2022-11-29-160928_fxdu.png?auto=format&crop=focalpoint&domain=generate-biomedicines-dev.imgix.net&fit=crop&fp-x=0.5&fp-y=0.5&h=1615&ixlib=php-3.3.1&q=61&w=1600&s=5853aaa7ad61dc0e1a26291cb2a11b6e 1600w, https://generate-biomedicines-dev.imgix.net/assets/Blog_All_boxplot_2022-11-29-160928_fxdu.png?auto=format&crop=focalpoint&domain=generate-biomedicines-dev.imgix.net&fit=crop&fp-x=0.5&fp-y=0.5&h=686&ixlib=php-3.3.1&q=82&w=680&s=e91efd58c8bc8c0e1f122a73423ad68a 680w, https://generate-biomedicines-dev.imgix.net/assets/Blog_All_boxplot_2022-11-29-160928_fxdu.png?auto=format&crop=focalpoint&domain=generate-biomedicines-dev.imgix.net&fit=crop&fp-x=0.5&fp-y=0.5&h=1373&ixlib=php-3.3.1&q=61&w=1360&s=7e5a8a730eb812da69531aac2c800b02 1360w, https://generate-biomedicines-dev.imgix.net/assets/Blog_All_boxplot_2022-11-29-160928_fxdu.png?auto=format&crop=focalpoint&domain=generate-biomedicines-dev.imgix.net&fit=crop&fp-x=0.5&fp-y=0.5&h=525&ixlib=php-3.3.1&q=82&w=520&s=d3e98ec0dba5729698b55f90da7ba755 520w, https://generate-biomedicines-dev.imgix.net/assets/Blog_All_boxplot_2022-11-29-160928_fxdu.png?auto=format&crop=focalpoint&domain=generate-biomedicines-dev.imgix.net&fit=crop&fp-x=0.5&fp-y=0.5&h=1050&ixlib=php-3.3.1&q=61&w=1040&s=453a27028429b7d5d601abe04d6ec47f 1040w, https://generate-biomedicines-dev.imgix.net/assets/Blog_All_boxplot_2022-11-29-160928_fxdu.png?auto=format&crop=focalpoint&domain=generate-biomedicines-dev.imgix.net&fit=crop&fp-x=0.5&fp-y=0.5&h=373&ixlib=php-3.3.1&q=82&w=370&s=40c4e6cd14ea0201c4210e302bc0ca02 370w, https://generate-biomedicines-dev.imgix.net/assets/Blog_All_boxplot_2022-11-29-160928_fxdu.png?auto=format&crop=focalpoint&domain=generate-biomedicines-dev.imgix.net&fit=crop&fp-x=0.5&fp-y=0.5&h=747&ixlib=php-3.3.1&q=61&w=740&s=53c8f27b5dde7c384514f82d91320a4c 740w)
Structural validation
Across a set of 10,000 samples of single-chain structures, we find that Chroma reproduces the structural statistics of proteins from the PDB.
Designability
Chroma generates protein molecules by first synthesizing a backbone structure and then designing sequences consistent with those 3D backbones. While the only true way to test the validity of these designs is experimental characterization, we can begin to check self-consistency by asking an orthogonal in silico structure prediction method whether it thinks the designed sequence should fold back into its intended shapes.
We find frequent agreement, even for reasonably sized molecules shown below. Of course, the true test of protein design is to make and test molecules in the lab, which is why machine learning is only the first step of what we do at Generate.
![](https://generate-biomedicines-dev.imgix.net/assets/Blog_coverage_not_length_normalized_2022-11-29-161127_pepw.png?auto=format&crop=focalpoint&domain=generate-biomedicines-dev.imgix.net&fit=crop&fp-x=0.5&fp-y=0.5&h=776&ixlib=php-3.3.1&q=82&w=800&s=99b1ac541c89f0f0c50876fe68c5ba5e 800w, https://generate-biomedicines-dev.imgix.net/assets/Blog_coverage_not_length_normalized_2022-11-29-161127_pepw.png?auto=format&crop=focalpoint&domain=generate-biomedicines-dev.imgix.net&fit=crop&fp-x=0.5&fp-y=0.5&h=1553&ixlib=php-3.3.1&q=61&w=1600&s=2d55e8a52deaa70c6b09daa7b4c470f0 1600w, https://generate-biomedicines-dev.imgix.net/assets/Blog_coverage_not_length_normalized_2022-11-29-161127_pepw.png?auto=format&crop=focalpoint&domain=generate-biomedicines-dev.imgix.net&fit=crop&fp-x=0.5&fp-y=0.5&h=660&ixlib=php-3.3.1&q=82&w=680&s=fdcc56eeb17c094cbdda4a95013fc6b6 680w, https://generate-biomedicines-dev.imgix.net/assets/Blog_coverage_not_length_normalized_2022-11-29-161127_pepw.png?auto=format&crop=focalpoint&domain=generate-biomedicines-dev.imgix.net&fit=crop&fp-x=0.5&fp-y=0.5&h=1320&ixlib=php-3.3.1&q=61&w=1360&s=277c56846d63e0313406c8faefcac382 1360w, https://generate-biomedicines-dev.imgix.net/assets/Blog_coverage_not_length_normalized_2022-11-29-161127_pepw.png?auto=format&crop=focalpoint&domain=generate-biomedicines-dev.imgix.net&fit=crop&fp-x=0.5&fp-y=0.5&h=504&ixlib=php-3.3.1&q=82&w=520&s=ab147440e352309c93a52e5c3989f4d9 520w, https://generate-biomedicines-dev.imgix.net/assets/Blog_coverage_not_length_normalized_2022-11-29-161127_pepw.png?auto=format&crop=focalpoint&domain=generate-biomedicines-dev.imgix.net&fit=crop&fp-x=0.5&fp-y=0.5&h=1009&ixlib=php-3.3.1&q=61&w=1040&s=c00a0e22afe1645bc7b89f8f0f8211d6 1040w, https://generate-biomedicines-dev.imgix.net/assets/Blog_coverage_not_length_normalized_2022-11-29-161127_pepw.png?auto=format&crop=focalpoint&domain=generate-biomedicines-dev.imgix.net&fit=crop&fp-x=0.5&fp-y=0.5&h=359&ixlib=php-3.3.1&q=82&w=370&s=6959fdb7575667bd87a77b1018f58862 370w, https://generate-biomedicines-dev.imgix.net/assets/Blog_coverage_not_length_normalized_2022-11-29-161127_pepw.png?auto=format&crop=focalpoint&domain=generate-biomedicines-dev.imgix.net&fit=crop&fp-x=0.5&fp-y=0.5&h=718&ixlib=php-3.3.1&q=61&w=740&s=cf50898b7febe1e19ddb85e8ed65ec14 740w)
Novelty
Natural proteins tend to be composed of well-defined conserved structural domains ranging between 50 and 200 residues in length. We assess novelty of Chroma-generated structures by computing the required number of common protein domains (CATH) needed to cover at least 80% of each generated structures above a structural cutoff (TM > 0.5). By this measure, Chroma proteins would seem to demonstrate greater novelty than collection of proteins from the PDB, despite being trained on proteins from the PDB. This could be a sign of learning the principles required for generalization, while not overfitting or memorizing the proteins that have already been seen.
Limitations
![](https://cdn.generatebiomedicines.com/assets/bloopers.png)
Examples of generated proteins that did not work well
Above are some examples of generated proteins that may illustrate some potential bugs and failure modes. (a) In some cases the conditioners result in protein samples that are not connected or are very sparsely connected. (b) Unconditioned samples can exhibit rare but significant pathologies such as clashes, poor topologies, and tangles. Large unconditioned samples (1000+ residues) sometimes have extended regions with low secondary structure content.
Some limitations are:
- The real test of any protein is wet lab synthesis and experimentation, and so far these are only in silico predictions.
- Conditional models can be difficult to tune and still tend to require expert supervision and collaboration with experienced protein designers for troubleshooting.
- Low temperature sampling can adjust macroscopic observables such as the balance of alpha and beta secondary structure content, which requires deeper understanding.
Read the paper
Illuminating protein space with a programmable generative model
John Ingraham, Max Baranov, Zak Costello, Vincent Frappier,
Ahmed Ismail, Shan Tie, Wujie Wang, Vincent Xue, Fritz Obermeyer, Andrew Beam, Gevorg Grigoryan
Three billion years of evolution have produced a tremendous diversity of protein molecules, and yet the full potential of this molecular class is likely far greater. Accessing this potential has been challenging for computation and experiment because the space of possible protein molecules is much larger than the space that are likely to host function. Here we introduce Chroma, a generative model for proteins and protein complexes that can directly sample novel protein structures and sequences and that can be conditioned to steer the generative process towards desired properties and functions. To enable this, we introduce a diffusion process that respects the conformational statistics of polymer ensembles, an efficient neural architecture for molecular systems based on random graph neural networks that enables long-range reasoning with sub-quadratic scaling, equivariant layers for efficiently synthesizing 3D structures of proteins from predicted inter-residue geometries, and a general low-temperature sampling algorithm for diffusion models. We suggest that Chroma can effectively realize protein design as Bayesian inference under external constraints, which can involve symmetries, substructure, shape, semantics, and even natural language prompts. With this unified approach, we hope to accelerate the prospect of programming protein matter for human health, materials science, and synthetic biology.
We would like to thank William F. DeGrado and Generate employees Adam Root, Alan Leung, Alex Ramos, Brett Hannigan, Eugene Palovcak, Frank Poelwijk, James Lucas, James McFarland, Karl Barber, Kristen Hopson, Martin Jankowiak, Mike Nally, Molly Gibson, Ross Federman, Stephen DeCamp, Thomas Linsky, Yue Liu, and Zander Harteveld for reading of the manuscript draft and providing helpful comments.