The AI industry is facing a looming data shortage as companies have already exhausted much of the available training data, but startups are exploring innovative solutions to address this challenge.
Synthetic data emerges as a potential solution: Gretel, a startup valued at $350 million, is creating AI-generated synthetic data that closely mimics real information without the privacy concerns:
- Synthetic data has been used by companies working with sensitive information, such as patient data, to protect privacy while still providing valuable training data for AI models.
- Gretel’s CEO, Ali Golshan, sees an opportunity to supply data-starved AI companies with fake data made from scratch to train their models, an approach already embraced by major players like Anthropic, Meta, Microsoft, and Google.
- However, synthetic data has limitations, such as exaggerating biases and failing to include outliers, which could worsen AI’s tendency to hallucinate or lead to “model collapse” if not supplemented with high-quality real data.
Human-powered data labeling and creation: Some startups are employing large numbers of human workers to clean up, label, and create new data for AI training:
- Scale AI, valued at $14 billion, employs around 200,000 workers through its subsidiary Remotasks to annotate data for top AI startups like OpenAI, Cohere, and Character AI.
- Toloka, based in Amsterdam, has crowdsourced 9 million “AI tutors” to label data, create original content, and work with domain experts to generate specialized data for niche AI models.
- However, managing large-scale human operations and ensuring fair compensation for workers remains a challenge in the AI industry.
Focusing on efficiency and specificity over volume: Researchers suggest that advanced AI may not always require massive amounts of data, and the industry is starting to shift towards smaller, task-specific models:
- Nestor Maslej, a researcher at Stanford University, believes that the human brain’s efficiency in learning from relatively little data compared to AI models indicates room for improvement in AI’s data efficiency.
- Startups like Mistral AI are building smaller, specialized models that require less data, such as Mathstral, an AI designed for math problems.
- Snorkel AI helps companies make the most of their existing data by providing software that enables staff to label data quickly, creating purpose-built models that don’t rely on massive volumes of data.
Broader implications: As the AI industry grapples with the data wall, the solutions being developed by startups and researchers could shape the future of AI development:
- The success of synthetic data and human-powered data creation in addressing the data shortage will likely influence the direction of AI research and the types of models being developed.
- The shift towards smaller, task-specific models may lead to a more diverse AI landscape, with a greater emphasis on efficiency and specialization rather than massive, general-purpose models.
- Addressing the data wall will be crucial for the continued growth and advancement of the AI industry, and the innovative approaches being explored by startups could play a significant role in overcoming this challenge.
The Internet Isn’t Big Enough To Train AI. One Fix? Fake Data.