×
Written by
Published on
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

The AI industry is facing a looming data shortage as companies have already exhausted much of the available training data, but startups are exploring innovative solutions to address this challenge.

Synthetic data emerges as a potential solution: Gretel, a startup valued at $350 million, is creating AI-generated synthetic data that closely mimics real information without the privacy concerns:

  • Synthetic data has been used by companies working with sensitive information, such as patient data, to protect privacy while still providing valuable training data for AI models.
  • Gretel’s CEO, Ali Golshan, sees an opportunity to supply data-starved AI companies with fake data made from scratch to train their models, an approach already embraced by major players like Anthropic, Meta, Microsoft, and Google.
  • However, synthetic data has limitations, such as exaggerating biases and failing to include outliers, which could worsen AI’s tendency to hallucinate or lead to “model collapse” if not supplemented with high-quality real data.

Human-powered data labeling and creation: Some startups are employing large numbers of human workers to clean up, label, and create new data for AI training:

  • Scale AI, valued at $14 billion, employs around 200,000 workers through its subsidiary Remotasks to annotate data for top AI startups like OpenAI, Cohere, and Character AI.
  • Toloka, based in Amsterdam, has crowdsourced 9 million “AI tutors” to label data, create original content, and work with domain experts to generate specialized data for niche AI models.
  • However, managing large-scale human operations and ensuring fair compensation for workers remains a challenge in the AI industry.

Focusing on efficiency and specificity over volume: Researchers suggest that advanced AI may not always require massive amounts of data, and the industry is starting to shift towards smaller, task-specific models:

  • Nestor Maslej, a researcher at Stanford University, believes that the human brain’s efficiency in learning from relatively little data compared to AI models indicates room for improvement in AI’s data efficiency.
  • Startups like Mistral AI are building smaller, specialized models that require less data, such as Mathstral, an AI designed for math problems.
  • Snorkel AI helps companies make the most of their existing data by providing software that enables staff to label data quickly, creating purpose-built models that don’t rely on massive volumes of data.

Broader implications: As the AI industry grapples with the data wall, the solutions being developed by startups and researchers could shape the future of AI development:

  • The success of synthetic data and human-powered data creation in addressing the data shortage will likely influence the direction of AI research and the types of models being developed.
  • The shift towards smaller, task-specific models may lead to a more diverse AI landscape, with a greater emphasis on efficiency and specialization rather than massive, general-purpose models.
  • Addressing the data wall will be crucial for the continued growth and advancement of the AI industry, and the innovative approaches being explored by startups could play a significant role in overcoming this challenge.
The Internet Isn’t Big Enough To Train AI. One Fix? Fake Data.

Recent News

AI Governance Takes Center Stage in ASEAN-Stanford HAI Workshop

Southeast Asian officials discuss AI governance challenges and regional cooperation with Stanford experts.

Slack is Launching AI Note-Taking for Huddles

The feature aims to streamline meetings and boost productivity by automatically generating notes during Slack huddles.

Google’s AI Tool ‘Food Mood’ Will Help You Create Mouth-Watering Meals

Google's new AI tool blends cuisines from different countries to create unique recipes for adventurous home cooks.