Large genome model: Open source AI trained on trillions of bases
SMRTR summary
After just four months since releasing their first bacterial genome AI, researchers have unveiled Evo 2, an open-source system trained on a staggering 8.8 trillion DNA bases from bacteria, archaea, and eukaryotes. Unlike its predecessor, which worked with straightforward bacterial genomes, Evo 2 tackles the messy complexity of human-like genomes where genes are interrupted by junk DNA and regulatory sequences are scattered across vast distances.
The system learned to spot features that even specialized software struggles with, recognizing splice sites, regulatory DNA, and protein-coding regions without any specific training on what to look for. When researchers mutated single DNA bases and fed them to Evo 2, it correctly identified which changes would disrupt crucial cellular functions.
Most remarkably, the AI developed an internal understanding of different species' genetic codes. As the researchers explain, "By learning the likelihood of sequences across vast evolutionary datasets, biological sequence models capture conserved sequence patterns that often reflect functional importance."
The big question remains whether Evo 2 has identified genome features that scientists don't even know exist yet.
SMRTR provides this summary for quick context. The original article belongs to Daily.dev.
Read the original article