AI Startups Tackle Looming Data Shortage with Innovative Solutions

Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage

Join Now

The AI industry is facing a looming data shortage as companies have already exhausted much of the available training data, but startups are exploring innovative solutions to address this challenge.

Synthetic data emerges as a potential solution: Gretel, a startup valued at $350 million, is creating AI-generated synthetic data that closely mimics real information without the privacy concerns:

Synthetic data has been used by companies working with sensitive information, such as patient data, to protect privacy while still providing valuable training data for AI models.
Gretel’s CEO, Ali Golshan, sees an opportunity to supply data-starved AI companies with fake data made from scratch to train their models, an approach already embraced by major players like Anthropic, Meta, Microsoft, and Google.
However, synthetic data has limitations, such as exaggerating biases and failing to include outliers, which could worsen AI’s tendency to hallucinate or lead to “model collapse” if not supplemented with high-quality real data.

Human-powered data labeling and creation: Some startups are employing large numbers of human workers to clean up, label, and create new data for AI training:

Scale AI, valued at $14 billion, employs around 200,000 workers through its subsidiary Remotasks to annotate data for top AI startups like OpenAI, Cohere, and Character AI.
Toloka, based in Amsterdam, has crowdsourced 9 million “AI tutors” to label data, create original content, and work with domain experts to generate specialized data for niche AI models.
However, managing large-scale human operations and ensuring fair compensation for workers remains a challenge in the AI industry.

Focusing on efficiency and specificity over volume: Researchers suggest that advanced AI may not always require massive amounts of data, and the industry is starting to shift towards smaller, task-specific models:

Nestor Maslej, a researcher at Stanford University, believes that the human brain’s efficiency in learning from relatively little data compared to AI models indicates room for improvement in AI’s data efficiency.
Startups like Mistral AI are building smaller, specialized models that require less data, such as Mathstral, an AI designed for math problems.
Snorkel AI helps companies make the most of their existing data by providing software that enables staff to label data quickly, creating purpose-built models that don’t rely on massive volumes of data.

Broader implications: As the AI industry grapples with the data wall, the solutions being developed by startups and researchers could shape the future of AI development:

The success of synthetic data and human-powered data creation in addressing the data shortage will likely influence the direction of AI research and the types of models being developed.
The shift towards smaller, task-specific models may lead to a more diverse AI landscape, with a greater emphasis on efficiency and specialization rather than massive, general-purpose models.
Addressing the data wall will be crucial for the continued growth and advancement of the AI industry, and the innovative approaches being explored by startups could play a significant role in overcoming this challenge.

The Internet Isn’t Big Enough To Train AI. One Fix? Fake Data.

Forbes

Menu

AI Startups Tackle Looming Data Shortage with Innovative Solutions

Recent News

$1B Solo.io’s Kagent Studio brings AI agents to Kubernetes workflows

81% of citizens lose trust when governments use AI for public services, says study

AI browsers replace search with autonomous agents that act for users

Join the revolution

CO/AI

Resources

Join the revolution

Menu

Welcome

AI Startups Tackle Looming Data Shortage with Innovative Solutions

Recent News

$1B Solo.io’s Kagent Studio brings AI agents to Kubernetes workflows

81% of citizens lose trust when governments use AI for public services, says study

AI browsers replace search with autonomous agents that act for users

Join the revolution

CO/AI

Resources

Join the revolution