The AI industry is facing a looming data shortage as companies have already exhausted much of the available training data, but startups are exploring innovative solutions to address this challenge.
Synthetic data emerges as a potential solution: Gretel, a startup valued at $350 million, is creating AI-generated synthetic data that closely mimics real information without the privacy concerns:
- Synthetic data has been used by companies working with sensitive information, such as patient data, to protect privacy while still providing valuable training data for AI models.
- Gretel’s CEO, Ali Golshan, sees an opportunity to supply data-starved AI companies with fake data made from scratch to train their models, an approach already embraced by major players like Anthropic, Meta, Microsoft, and Google.
- However, synthetic data has limitations, such as exaggerating biases and failing to include outliers, which could worsen AI’s tendency to hallucinate or lead to “model collapse” if not supplemented with high-quality real data.
Human-powered data labeling and creation: Some startups are employing large numbers of human workers to clean up, label, and create new data for AI training:
- Scale AI, valued at $14 billion, employs around 200,000 workers through its subsidiary Remotasks to annotate data for top AI startups like OpenAI, Cohere, and Character AI.
- Toloka, based in Amsterdam, has crowdsourced 9 million “AI tutors” to label data, create original content, and work with domain experts to generate specialized data for niche AI models.
- However, managing large-scale human operations and ensuring fair compensation for workers remains a challenge in the AI industry.
Focusing on efficiency and specificity over volume: Researchers suggest that advanced AI may not always require massive amounts of data, and the industry is starting to shift towards smaller, task-specific models:
- Nestor Maslej, a researcher at Stanford University, believes that the human brain’s efficiency in learning from relatively little data compared to AI models indicates room for improvement in AI’s data efficiency.
- Startups like Mistral AI are building smaller, specialized models that require less data, such as Mathstral, an AI designed for math problems.
- Snorkel AI helps companies make the most of their existing data by providing software that enables staff to label data quickly, creating purpose-built models that don’t rely on massive volumes of data.
Broader implications: As the AI industry grapples with the data wall, the solutions being developed by startups and researchers could shape the future of AI development:
- The success of synthetic data and human-powered data creation in addressing the data shortage will likely influence the direction of AI research and the types of models being developed.
- The shift towards smaller, task-specific models may lead to a more diverse AI landscape, with a greater emphasis on efficiency and specialization rather than massive, general-purpose models.
- Addressing the data wall will be crucial for the continued growth and advancement of the AI industry, and the innovative approaches being explored by startups could play a significant role in overcoming this challenge.
Recent Stories
DOE fusion roadmap targets 2030s commercial deployment as AI drives $9B investment
The Department of Energy has released a new roadmap targeting commercial-scale fusion power deployment by the mid-2030s, though the plan lacks specific funding commitments and relies on scientific breakthroughs that have eluded researchers for decades. The strategy emphasizes public-private partnerships and positions AI as both a research tool and motivation for developing fusion energy to meet data centers' growing electricity demands. The big picture: The DOE's roadmap aims to "deliver the public infrastructure that supports the fusion private sector scale up in the 2030s," but acknowledges it cannot commit to specific funding levels and remains subject to Congressional appropriations. Why...
Oct 17, 2025Tying it all together: Credo’s purple cables power the $4B AI data center boom
Credo, a Silicon Valley semiconductor company specializing in data center cables and chips, has seen its stock price more than double this year to $143.61, following a 245% surge in 2024. The company's signature purple cables, which cost between $300-$500 each, have become essential infrastructure for AI data centers, positioning Credo to capitalize on the trillion-dollar AI infrastructure expansion as hyperscalers like Amazon, Microsoft, and Elon Musk's xAI rapidly build out massive computing facilities. What you should know: Credo's active electrical cables (AECs) are becoming indispensable for connecting the massive GPU clusters required for AI training and inference. The company...
Oct 17, 2025Vatican launches Latin American AI network for human development
The Vatican hosted a two-day conference bringing together 50 global experts to explore how artificial intelligence can advance peace, social justice, and human development. The event launched the Latin American AI Network for Integral Human Development and established principles for ethical AI governance that prioritize human dignity over technological advancement. What you should know: The Pontifical Academy of Social Sciences, the Vatican's research body for social issues, organized the "Digital Rerum Novarum" conference on October 16-17, combining academic research with practical AI applications. Participants included leading experts from MIT, Microsoft, Columbia University, the UN, and major European institutions. The conference...