×
AI Startups Tackle Looming Data Shortage with Innovative Solutions
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

The AI industry is facing a looming data shortage as companies have already exhausted much of the available training data, but startups are exploring innovative solutions to address this challenge.

Synthetic data emerges as a potential solution: Gretel, a startup valued at $350 million, is creating AI-generated synthetic data that closely mimics real information without the privacy concerns:

  • Synthetic data has been used by companies working with sensitive information, such as patient data, to protect privacy while still providing valuable training data for AI models.
  • Gretel’s CEO, Ali Golshan, sees an opportunity to supply data-starved AI companies with fake data made from scratch to train their models, an approach already embraced by major players like Anthropic, Meta, Microsoft, and Google.
  • However, synthetic data has limitations, such as exaggerating biases and failing to include outliers, which could worsen AI’s tendency to hallucinate or lead to “model collapse” if not supplemented with high-quality real data.

Human-powered data labeling and creation: Some startups are employing large numbers of human workers to clean up, label, and create new data for AI training:

  • Scale AI, valued at $14 billion, employs around 200,000 workers through its subsidiary Remotasks to annotate data for top AI startups like OpenAI, Cohere, and Character AI.
  • Toloka, based in Amsterdam, has crowdsourced 9 million “AI tutors” to label data, create original content, and work with domain experts to generate specialized data for niche AI models.
  • However, managing large-scale human operations and ensuring fair compensation for workers remains a challenge in the AI industry.

Focusing on efficiency and specificity over volume: Researchers suggest that advanced AI may not always require massive amounts of data, and the industry is starting to shift towards smaller, task-specific models:

  • Nestor Maslej, a researcher at Stanford University, believes that the human brain’s efficiency in learning from relatively little data compared to AI models indicates room for improvement in AI’s data efficiency.
  • Startups like Mistral AI are building smaller, specialized models that require less data, such as Mathstral, an AI designed for math problems.
  • Snorkel AI helps companies make the most of their existing data by providing software that enables staff to label data quickly, creating purpose-built models that don’t rely on massive volumes of data.

Broader implications: As the AI industry grapples with the data wall, the solutions being developed by startups and researchers could shape the future of AI development:

  • The success of synthetic data and human-powered data creation in addressing the data shortage will likely influence the direction of AI research and the types of models being developed.
  • The shift towards smaller, task-specific models may lead to a more diverse AI landscape, with a greater emphasis on efficiency and specialization rather than massive, general-purpose models.
  • Addressing the data wall will be crucial for the continued growth and advancement of the AI industry, and the innovative approaches being explored by startups could play a significant role in overcoming this challenge.
The Internet Isn’t Big Enough To Train AI. One Fix? Fake Data.

Recent News

Grok stands alone as X restricts AI training on posts in new policy update

X explicitly bans third-party AI companies from using tweets for model training while still preserving access for its own Grok AI.

Coming out of the dark: Shadow AI usage surges in enterprise IT

IT leaders report 90% concern over unauthorized AI tools, with most organizations already suffering negative consequences including data leaks and financial losses.

Anthropic CEO opposes 10-year AI regulation ban in NYT op-ed

As AI capabilities rapidly accelerate, Anthropic's chief executive argues for targeted federal transparency standards rather than blocking state-level regulation for a decade.