EvolutionaryScale Secures $142M to Train AI on 2.8B Protein Sequences
Led by Lux Capital, the biological AI startup targets the drug discovery market with proprietary genomic datasets.
EvolutionaryScale has finalized a $142 million seed funding round (https://techcrunch.com/2024/06/25/evolutionaryscale-is-building-a-chatgpt-for-biology-with-142m-from-nat-friedman-and-lux-capital/) to commercialize ESM3, a generative AI model trained on a massive corpus of 2.78 billion protein sequences (https://www.forbes.com/sites/alexkonrad/2024/06/25/evolutionaryscale-raises-142-million-for-biological-ai/). The round, led by Lux Capital, Nat Friedman, and Daniel Gross, with participation from Amazon and NVentures (Nvidia’s venture arm), signals a decisive shift in the data-asset market: the transition from general-purpose LLMs to specialized, high-fidelity biological data models. ESM3 represents one of the largest applications of scientific data in the AI era, boasting 98 billion parameters (https://www.forbes.com/sites/alexkonrad/2024/06/25/evolutionaryscale-raises-142-million-for-biological-ai/) and the ability to simulate 500 million years of evolution to design new proteins.
The Biological Data Frontier
Unlike the text-heavy datasets that powered the first wave of generative AI, EvolutionaryScale’s value proposition is built entirely on the curation and processing of genomic and proteomic data. By training on billions of sequences, the company is effectively creating a "programmable biology" layer. This move underscores the premium now placed on structured scientific data, which is far scarcer and more difficult to ingest than public web text. The involvement of Amazon and Nvidia (https://techcrunch.com/2024/06/25/evolutionaryscale-is-building-a-chatgpt-for-biology-with-142m-from-nat-friedman-and-lux-capital/) suggests that the infrastructure providers are eager to secure a foothold in the biological data pipeline, which is expected to revolutionize the $1 trillion pharmaceutical R&D sector.
OpenAI’s Strategic Data Acquisition
The quest for data efficiency is not limited to biology. OpenAI recently announced its acquisition of Rockset (https://openai.com/index/openai-to-acquire-rockset/), a real-time search and analytics database company. This acquisition is a clear tactical move to bolster OpenAI’s retrieval-augmented generation (RAG) capabilities. By integrating Rockset’s technology, OpenAI can more effectively index and query the massive datasets provided by its enterprise partners, turning static data repositories into dynamic, actionable intelligence. This deal highlights the growing importance of the "data-to-model" interface—the software layer that determines how efficiently an AI can access and reason over proprietary enterprise assets.
The Clinical Data Land Grab
Further emphasizing the value of specialized data, HEALWELL AI has entered into a definitive agreement to acquire BioPharma Services (https://www.globenewswire.com/news-release/2024/06/24/2903058/0/en/HEALWELL-AI-to-Acquire-BioPharma-Services-a-Leading-Full-Service-Contract-Research-Organization.html) for approximately $11.5 million (https://www.globenewswire.com/news-release/2024/06/24/2903058/0/en/HEALWELL-AI-to-Acquire-BioPharma-Services-a-Leading-Full-Service-Contract-Research-Organization.html). BioPharma Services is a full-service Contract Research Organization (CRO) that possesses deep clinical trial data assets. For HEALWELL, this is not just an expansion of services but a strategic acquisition of a data pipeline. Access to high-quality clinical trial data is the primary bottleneck for AI-driven drug discovery and personalized medicine, and acquiring a CRO provides a direct, proprietary source of the "ground truth" data required to train diagnostic and therapeutic models.
Regulatory Walls and Data Portability
As the value of data assets climbs, regulators are moving to ensure that this value is not locked behind the "walled gardens" of Big Tech. The European Commission recently issued preliminary findings that Apple is in breach of the Digital Markets Act (DMA) (https://ec.europa.eu/commission/presscorner/detail/en/ip_24_3433). The focus of the investigation includes Apple’s steering rules, which prevent developers from freely directing consumers to alternative offers and data ecosystems. This regulatory pressure is part of a broader global trend aimed at enforcing data portability and interoperability. For data investors, these rulings are critical: they signal a future where the control over user data and the ability to monetize it through secondary licensing will be subject to intense antitrust scrutiny.
Why it matters for data owners
The EvolutionaryScale and Healwell deals demonstrate that the most lucrative data assets are no longer found in the "open web" but in specialized, high-moat domains like genomics and clinical medicine. For data owners, the lesson is clear: the market is moving away from bulk data licensing toward high-precision, structured datasets that can be directly ingested by specialized AI architectures. Whether it is protein sequences or real-time enterprise data, the value lies in the data’s unique ability to solve specific, high-value problems that general-purpose models cannot touch. Monetization strategies should focus on data cleanliness, regulatory compliance, and the ability to integrate with the latest RAG and generative architectures.
d-nvest turns the data assets behind these deals into scored, actionable opportunities.
Explore the pipeline →