×
Written by
Published on
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

The AI industry is facing a looming data shortage as companies have already exhausted much of the available training data, but startups are exploring innovative solutions to address this challenge.

Synthetic data emerges as a potential solution: Gretel, a startup valued at $350 million, is creating AI-generated synthetic data that closely mimics real information without the privacy concerns:

  • Synthetic data has been used by companies working with sensitive information, such as patient data, to protect privacy while still providing valuable training data for AI models.
  • Gretel’s CEO, Ali Golshan, sees an opportunity to supply data-starved AI companies with fake data made from scratch to train their models, an approach already embraced by major players like Anthropic, Meta, Microsoft, and Google.
  • However, synthetic data has limitations, such as exaggerating biases and failing to include outliers, which could worsen AI’s tendency to hallucinate or lead to “model collapse” if not supplemented with high-quality real data.

Human-powered data labeling and creation: Some startups are employing large numbers of human workers to clean up, label, and create new data for AI training:

  • Scale AI, valued at $14 billion, employs around 200,000 workers through its subsidiary Remotasks to annotate data for top AI startups like OpenAI, Cohere, and Character AI.
  • Toloka, based in Amsterdam, has crowdsourced 9 million “AI tutors” to label data, create original content, and work with domain experts to generate specialized data for niche AI models.
  • However, managing large-scale human operations and ensuring fair compensation for workers remains a challenge in the AI industry.

Focusing on efficiency and specificity over volume: Researchers suggest that advanced AI may not always require massive amounts of data, and the industry is starting to shift towards smaller, task-specific models:

  • Nestor Maslej, a researcher at Stanford University, believes that the human brain’s efficiency in learning from relatively little data compared to AI models indicates room for improvement in AI’s data efficiency.
  • Startups like Mistral AI are building smaller, specialized models that require less data, such as Mathstral, an AI designed for math problems.
  • Snorkel AI helps companies make the most of their existing data by providing software that enables staff to label data quickly, creating purpose-built models that don’t rely on massive volumes of data.

Broader implications: As the AI industry grapples with the data wall, the solutions being developed by startups and researchers could shape the future of AI development:

  • The success of synthetic data and human-powered data creation in addressing the data shortage will likely influence the direction of AI research and the types of models being developed.
  • The shift towards smaller, task-specific models may lead to a more diverse AI landscape, with a greater emphasis on efficiency and specialization rather than massive, general-purpose models.
  • Addressing the data wall will be crucial for the continued growth and advancement of the AI industry, and the innovative approaches being explored by startups could play a significant role in overcoming this challenge.
The Internet Isn’t Big Enough To Train AI. One Fix? Fake Data.

Recent News

AI Tutors Double Student Learning in Harvard Study

Students using an AI tutor demonstrated twice the learning gains in half the time compared to traditional lectures, suggesting potential for more efficient and personalized education.

Lionsgate Teams Up With Runway On Custom AI Video Generation Model

The studio aims to develop AI tools for filmmakers using its vast library, raising questions about content creation and creative rights.

How to Successfully Integrate AI into Project Management Practices

AI-powered tools automate routine tasks, analyze data for insights, and enhance decision-making, promising to boost productivity and streamline project management across industries.