Formation Bio Secures $372M Series D for AI-Data Drug Discovery
Led by a16z and Sanofi, the round accelerates Formation Bio’s mission to automate drug development via proprietary data.
Formation Bio has closed a $372 million Series D funding round (disclosed) to scale its AI-native clinical trial platform and accelerate the acquisition of pharmaceutical data assets. The round was led by Andreessen Horowitz (a16z) with significant participation from the global healthcare giant Sanofi, signaling a major shift in how the industry values the intersection of proprietary clinical data and generative AI.
The Industrialization of Biological Data
Unlike traditional contract research organizations (CROs), Formation Bio operates as a tech-enabled pharmaceutical company that builds its own pipeline by acquiring clinical-stage drugs. The core of their strategy lies in their proprietary data engine, which uses AI to automate trial design, patient recruitment, and data analysis. This specialized focus on biological data assets is mirrored by the recent launch of EvolutionaryScale, which raised $142 million (disclosed) to develop "biological LLMs" capable of designing new proteins. These deals underscore a broader market trend where the value of a dataset is no longer just in its volume, but in its ability to generate high-fidelity, actionable biological outcomes.
The involvement of Sanofi is particularly strategic. By integrating Formation Bio’s AI capabilities, the pharmaceutical giant aims to reduce the traditionally high failure rates of clinical trials. This partnership follows a pattern of major incumbents investing heavily in the data infrastructure of their disruptors to secure a seat at the table of the next generation of drug discovery. The capital will specifically be used to acquire new drug candidates and further refine the AI models that manage the massive influx of trial data.
Infrastructure and Retrieval Moats
The race to control the data pipeline is not limited to biotech. As AI models become more commoditized, the focus has shifted toward the "data moat"—the proprietary information and the infrastructure required to process it in real-time. This was evident in OpenAI’s recent acquisition of Rockset (disclosed), a real-time analytics database company. By bringing Rockset’s technology in-house, OpenAI is strengthening its Retrieval-Augmented Generation (RAG) capabilities, allowing its models to interact more efficiently with enterprise data assets. Similarly, Apple and Meta have reportedly discussed a partnership (estimated) to integrate Meta’s Llama models into Apple Intelligence, a move that would bridge the gap between Meta’s model weights and Apple’s vast ecosystem of user data.
The investment landscape remains aggressive for those building the foundational hardware to process these datasets. Etched secured $120 million in Series A funding (disclosed) to develop a specialized chip, Sohu, designed specifically to run transformer models. This hardware-level optimization is a direct response to the massive compute requirements of today's data-intensive AI applications.
The Regulatory Reckoning for Training Data
However, the rapid monetization of data assets is facing a significant legal challenge. The Recording Industry Association of America (RIAA), representing giants like Sony and Universal, has filed lawsuits against AI music startups Suno and Udio. The plaintiffs are seeking statutory damages of up to $150,000 per infringed work (estimated legal exposure), alleging that the companies used unlicensed copyrighted music to train their models. This litigation represents a pivotal moment for the data economy: if the courts rule that training on public data without a license is not "fair use," the cost of high-quality training sets will skyrocket, fundamentally altering the unit economics of AI development.
Why it matters for data owners
The Formation Bio round and the RIAA litigation represent two sides of the same coin for data owners. On one hand, specialized, high-integrity datasets in fields like biology and medicine are commanding massive premiums and driving nine-figure funding rounds. On the other, the era of "free" training data is coming to an end. For data asset owners, the message is clear: the market is moving toward a formal licensing and acquisition model. Whether you own clinical trial results, musical catalogs, or real-time enterprise data, your assets are now the primary bottleneck—and the primary value driver—in the global AI race.
d-nvest turns the data assets behind these deals into scored, actionable opportunities.
Explore the pipeline →