An AI model that can decode and design living organisms
The Arc Institute's new EVO 2 model does for DNA what ChatGPT did for words
Last week the Arc Institute, in collaboration with NVIDIA and researchers from Stanford, UC Berkeley, and UC San Francisco, published an AI frontier model that is capable of generating plausible and seemingly functional DNA sequences — much as mainstream generative AI platforms generate text or code. And while it’s still early days for general purpose AI models that speak the language of DNA, the implications of the new model are profound.
On one level, the emergence of generative AI models that “speak” DNA rather than natural language, or computer code, seems inevitable. After all, DNA as it appears in biological organisms is just another type of language that connects a sequence of symbols with functional outcomes.
Yet the translation of DNA sequences into outcomes — such as protein synthesis and cellular functions, all the way up to influencing what organisms look like and how they behave — is fiendishly complex. And as a result, it’s largely remained out of reach of large language models that excel with text — until now.
Evo 2 — the new model from the Arc Institute — builds on previous work on AI genome models. But the sheer scale of its training set and context window — the amount of information it can work with — place it in a new category of model.
Like a more conventional generative AI model, Evo 2 is, in the words of the just-released paper that describes it, “fundamentally a generative model trained to predict the next base pair in a sequence.”
In other words, just as ChatGPT, DeepSeek, or other models predict the most likely words, sentences and paragraphs that follow the prompt you give them, Evo 2 predicts the most likely sequence of DNA base pairs that follow a DNA “prompt.”
But while a text based large language model is trained on billions of pages of (mainly) human-generated text, EVO 2 is trained on trillions of DNA base pairs — 9.3 trillion to be precise — spanning over 128,000 complete genomes covering bacteria, archaea, phages, plants, and other single-celled and multi-cellular species; including humans.
As a result, Evo 2 has access to an incredibly detailed “book of DNA” (or book of life if you’re feeling poetic) that allows it — without additional explicit training — to predict functional DNA sequences with uncanny accuracy.
It does this by being able to draw on similarities across species to generate probabilistically reasonable new sequences. It is, in effect, a DNA-based stochastic parrot — except that it’s ability to “parrot” biology far exceeds anything humans are capable of on their own.
As a simple example of this, researchers using the model are able to feed it the beginnings of a sequence of DNA that is known to be associated with a specific gene, and Evo 2 can generate the rest of the sequence — including variants. Do the same with a simple single celled organism, and it can construct the entire genome.
But it can go beyond this and generate completely novel sequences that, while new, still follow the “rules of life” that it’s “learned” from its training set — leading to the potential creation of new functional genes and gene clusters, and even the complete genomes of organisms that have never previously existed.
And importantly — because I suspect this will be one of the first questions people ask — this is without Evo 2 being explicitly trained on associations between genes and phenotypes (the observable characteristics of an organism). In other words, the model is interpreting and interpolating what reasonable DNA sequences look like within a given context, without an understanding of how these correlate with physical characteristics and behaviors (much like an advanced text-based AI model can generate beautiful text that emulates understanding, but which isn’t actually grounded in understanding).
Of course, there’s a large gap between a whole genome sitting on a computer, and what it takes to translate this into a living organism. For one thing, you need an existing cell structure to insert the synthesized DNA into.
But the technology already exists for doing this — as J. Craig Venter and his team demonstrated by creating the first synthetic organism back in 2010.
As Evo 2 is only a few days old at this point, it’s too early to say how researchers will use the model. But there’s already considerable excitement around how a generalized genomic AI model could revolutionize everything from understanding gene-based diseases and opening up new possibilities in gene-based precision medicine, to radically expanding what is possible with gene editing.
What is clear is that, just as text based AI models are transforming what is possible in a world built around the spoken and written word (or computer code), DNA-based models are likely to be equally transformative in the biomedical sciences and beyond.
Imagine, for instance, that these emerging models are able to identify novel approaches to precision gene editing.
Or imagine that they open up ways of encoding biological DNA to carry out functions that have never been considered before — remembering that DNA is a biological code that defines functional molecular machines, and who knows what molecular machines we could conceive of and build using advanced DNA-based AI models.1
And imagine that these new AI models lead to new ways of interfacing biological and non-biological systems by coding DNA in ways that allow cells to successfully fuse with machines — in brain computer interfaces say, or by allowing nanoscale machines to be integrated with biological systems.
And pushing this out of the domain of living organisms, imagine that they open up the ability to use DNA as the basis of functional materials with a degree of sophistication that we haven’t previously been able to achieve. We’ve already seen research into utilizing the spatial and structural properties of DNA with “DNA origami,” and using DNA as an information storage system — and even a compute substrate. Imagine what might be possible with AI-enhanced DNA coding.
Finally, imagine being able to use AI to design and create hybrid biological and mechanical systems by simultaneously coding in DNA on the bio side, and atoms and molecules on the non-bio material side.
Speculative as this is, it’s not beyond plausibility — and it would open the door to everything from sophisticated organoid-machine hybrids such as “brains on a chip” to seamless integration between biology and machines at the cellular, organ and whole organism level.
Admittedly this is all beginning to feel a little sci-fi — and it is pushing the bounds of imagination. But given that it’s taken just over 2 years to go from the early iterations of ChatGPT to AI systems that can simulate reasoning, carry out complex research, and accelerate discovery, I don’t think it’s unreasonable to assume that transformative frontier AI models that think and create using the language of DNA are closer than we might think.
Of course, as well as opening up new opportunities, this does raise the possibility of quite profound risks if we get things wrong.
Despite ideas like responsible AI seemingly going out of fashion at lightening speed at the moment while concepts like “permissionless innovation” take their place,2 the team behind Evo 2 were cognizant of the potential unintended consequences of their work.
Rather smartly, they intentionally did not include the genomes of pathogenic viruses in their training data, to avoid the chances of the model spitting out the blueprints for new or modified deadly viruses. They went on to “red team” the model to make sure it’s ability to generate sequences for pathogenic human viral proteins had been hobbled — it had.
But I suspect that the domain of unexpected consequences from this emerging technology to human and environmental health and wellbeing go way beyond harmful viruses. And it would be good to see teams working on bringing experts in with a deep knowledge and cross-disciplinary understanding of how to successfully navigate highly disruptive and deeply complex technology transitions.
Somewhat ironically, the scientific community were facing a similar challenge almost 50 years ago to the day — but on what now looks like a much smaller scale. Between February 24 - 27 1975, a group of scientists met at the Asilomar Conference Center in California to grapple with the potential dangers of research into recombinant DNA — along with their collective responsibility to ensure the technology’s safe and ethical development and use. The meeting was a landmark in establishing the foundations of responsible and beneficial genetic manipulation.
This coming week, the 50th anniversary of that meeting is being marked by another gathering of researchers at Asilomar — this one to grapple with today’s advances in gene-based technologies.
The Spirit of Asilomar meeting — which coincides to the day with the 50th anniversary of that original meeting — will be covering everything from pathogens research and bioweapons, to synthetic cells and genetic containment.
And one of those topics is artificial intelligence and biology.
I would make a shrewd guess that Evo 2 will be a topic of heated conversation at the meeting — and hopefully one that leads to new thinking around how this emerging capacity for AI to code in the language of DNA can be steered toward positive outcomes.
Especially in a world where the name of the AI game is increasingly to go fast and break things in the hope that someone else will clean up the mess.
I’ve long been intrigued with the parallels between coding using DNA and coding using digital ones and zeroes. Innovative coders have used digital platforms to be incredibly creative, including creating everything from elegant but devastating computer viruses to the foundations of blockchain. Imagine what might be possible with AI-enabled coding using the four base pairs of DNA.
I was intending to write about the upsurge in adoption of the idea of permissionless innovation this week, but the release of Evo 2 scuppered that plan. Hopefully next week …