OpenAI Strikes $250M Data Licensing Deal with News Corp
The five-year pact secures premium journalistic archives from WSJ and Barron’s for AI training and inference.
OpenAI has formalized a landmark content licensing agreement with News Corp, a deal estimated to be worth more than $250 million (https://www.wsj.com/business/media/openai-news-corp-deal-250-million-4d642b5d) over a five-year period. This disclosed partnership grants the Microsoft-backed AI giant access to current and archived content from major publications including The Wall Street Journal, Barron’s, MarketWatch, and The Times, effectively turning premium journalistic output into a high-fidelity training stream for its next-generation world models. The move signals a strategic pivot by OpenAI to insulate its data pipeline from the growing legal and ethical risks associated with unauthorized web scraping.
The Strategic Value of Premium Textual Assets
The agreement is not merely a defensive legal maneuver; it is a calculated bet on the superior performance of curated, high-authority datasets. As frontier models approach the limits of publicly available internet data, the industry is entering a phase of "data scarcity" where the quality of tokens matters more than the raw volume. By securing the News Corp archive, OpenAI gains access to decades of structured, fact-checked, and contextually rich human reasoning. This is critical for improving the factual accuracy and reasoning capabilities of models like GPT-5, which aim to function as more reliable agents in professional and financial environments. The deal is structured to provide OpenAI with the right to display content in response to user queries, further blurring the line between search engines and generative AI interfaces.
Scale AI and the $1B Infrastructure of Data Abundance
The institutional push for high-quality data is further evidenced by Scale AI’s recently closed $1 billion (https://techcrunch.com/2024/05/21/scale-ai-raises-1-billion-at-a-13-8-billion-valuation/) Series F funding round, which valued the company at $13.8 billion (https://www.reuters.com/technology/scale-ai-raises-1-billion-valuation-doubles-138-billion-2024-05-21/). Scale AI serves as the critical middleman in the data-asset economy, providing the human-in-the-loop (HITL) labeling and RLHF (Reinforcement Learning from Human Feedback) necessary to turn raw data—like the News Corp archives—into machine-ready training sets. This funding round, led by Accel with participation from sovereign wealth funds, underscores that the physical and human infrastructure required to process data is now as valuable as the compute power itself. As world models evolve to process multi-modal inputs—video, audio, and sensor data—the complexity of labeling these assets increases exponentially, creating a massive moat for those who control the data supply chain.
DeepL and the Rise of Specialized Data Moats
While general-purpose models fight for news archives, specialized AI firms are proving the value of niche data assets. DeepL, the German translation AI specialist, recently secured $300 million (https://www.reuters.com/technology/ai-startup-deepl-valued-2-billion-after-latest-funding-round-2024-05-22/) in investment at a $2 billion (https://techcrunch.com/2024/05/22/deepl-the-ai-translation-startup-is-now-valued-at-2b/) valuation. DeepL’s success is built on a proprietary dataset of high-quality translations that outperforms larger models trained on noisier data. This confirms a growing trend in the d-nvest intelligence space: data owners who possess unique, industry-specific datasets (legal, medical, or linguistic) are seeing their asset valuations soar as generalist AI companies look to acquire specialized "knowledge moats" to differentiate their offerings.
Regulatory Guardrails: The EU AI Act Finalized
The market for data deals is now operating under a new global standard. The European Council has officially given its final approval (https://www.consilium.europa.eu/en/press/press-releases/2024/05/21/artificial-intelligence-ai-act-council-gives-final-green-light-to-the-first-worldwide-rules-on-ai/) to the EU AI Act, the world’s first comprehensive framework for artificial intelligence. The regulation introduces strict transparency requirements for general-purpose AI models, including the obligation to provide detailed summaries of the data used for training. This regulatory clarity is expected to accelerate the trend of formal licensing deals, as companies seek to avoid the "high-risk" designation and potential fines associated with non-compliant data sourcing. For data investors, the EU AI Act transforms data provenance from a legal footnote into a primary valuation driver.
Why it matters for data owners
For owners of high-quality, structured data assets, the OpenAI-News Corp deal is a watershed moment that establishes a clear market price for premium content. We are moving from an era of data exploitation to one of data monetization. As AI developers shift their focus toward "World Models" that require deep contextual understanding and factual grounding, the leverage shifts back to the content creators. Data owners should no longer view their archives as historical records, but as high-yield liquid assets that can be licensed repeatedly across different AI verticals. The key to maximizing value lies in data readiness: ensuring archives are digitized, metadata-rich, and legally cleared for AI training.
d-nvest turns the data assets behind these deals into scored, actionable opportunities.
Explore the pipeline →