×
AI Startups Tackle Looming Data Shortage with Innovative Solutions
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

The AI industry is facing a looming data shortage as companies have already exhausted much of the available training data, but startups are exploring innovative solutions to address this challenge.

Synthetic data emerges as a potential solution: Gretel, a startup valued at $350 million, is creating AI-generated synthetic data that closely mimics real information without the privacy concerns:

  • Synthetic data has been used by companies working with sensitive information, such as patient data, to protect privacy while still providing valuable training data for AI models.
  • Gretel’s CEO, Ali Golshan, sees an opportunity to supply data-starved AI companies with fake data made from scratch to train their models, an approach already embraced by major players like Anthropic, Meta, Microsoft, and Google.
  • However, synthetic data has limitations, such as exaggerating biases and failing to include outliers, which could worsen AI’s tendency to hallucinate or lead to “model collapse” if not supplemented with high-quality real data.

Human-powered data labeling and creation: Some startups are employing large numbers of human workers to clean up, label, and create new data for AI training:

  • Scale AI, valued at $14 billion, employs around 200,000 workers through its subsidiary Remotasks to annotate data for top AI startups like OpenAI, Cohere, and Character AI.
  • Toloka, based in Amsterdam, has crowdsourced 9 million “AI tutors” to label data, create original content, and work with domain experts to generate specialized data for niche AI models.
  • However, managing large-scale human operations and ensuring fair compensation for workers remains a challenge in the AI industry.

Focusing on efficiency and specificity over volume: Researchers suggest that advanced AI may not always require massive amounts of data, and the industry is starting to shift towards smaller, task-specific models:

  • Nestor Maslej, a researcher at Stanford University, believes that the human brain’s efficiency in learning from relatively little data compared to AI models indicates room for improvement in AI’s data efficiency.
  • Startups like Mistral AI are building smaller, specialized models that require less data, such as Mathstral, an AI designed for math problems.
  • Snorkel AI helps companies make the most of their existing data by providing software that enables staff to label data quickly, creating purpose-built models that don’t rely on massive volumes of data.

Broader implications: As the AI industry grapples with the data wall, the solutions being developed by startups and researchers could shape the future of AI development:

  • The success of synthetic data and human-powered data creation in addressing the data shortage will likely influence the direction of AI research and the types of models being developed.
  • The shift towards smaller, task-specific models may lead to a more diverse AI landscape, with a greater emphasis on efficiency and specialization rather than massive, general-purpose models.
  • Addressing the data wall will be crucial for the continued growth and advancement of the AI industry, and the innovative approaches being explored by startups could play a significant role in overcoming this challenge.
The Internet Isn’t Big Enough To Train AI. One Fix? Fake Data.

Recent News

Baidu reports steepest revenue drop in 2 years amid slowdown

China's tech giant Baidu saw revenue drop 3% despite major AI investments, signaling broader challenges for the nation's technology sector amid economic headwinds.

How to manage risk in the age of AI

A conversation with Palo Alto Networks CEO about his approach to innovation as new technologies and risks emerge.

How to balance bold, responsible and successful AI deployment

Major companies are establishing AI governance structures and training programs while racing to deploy generative AI for competitive advantage.