×
Training an AI model? Mostly AI will take your real data and privacy-proof it
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

AI privacy innovation: Mostly AI has launched a synthetic text functionality that generates privacy-protected data for training enterprise AI models, addressing concerns about using real customer data containing personally identifiable information.

  • The new tool automates the process of creating synthetic data while preserving the patterns of original datasets, allowing businesses to leverage customer insights without risking privacy.
  • Synthetic data can also be used to rebalance datasets, remove bias, and generate mock data for software testing.

How the technology works: Mostly AI’s platform allows companies to upload proprietary datasets and fine-tune generators to create privacy-protected, synthesized versions of their data.

  • Users can upload data from local devices or external sources and select from various language models, including pre-trained options from HuggingFace.
  • The resulting synthetic data preserves original statistical patterns while complying with privacy protection regulations such as GDPR and CCPA.
  • Mostly AI claims its synthetic text can deliver performance improvements of up to 35% compared to text generated by prompting GPT-4o-mini with few or no real-world examples.

Industry context and potential impact: The launch of Mostly AI’s synthetic text functionality comes as more companies invest in generative AI for specific use cases and products, increasing the importance of proprietary data for training large language models.

  • Unlike public models like ChatGPT, which are trained on vast amounts of scraped internet data, enterprise AI often requires specialized training on a business’s customer data.
  • Synthetic data offers a solution to the privacy risks associated with using real customer information containing personally identifiable information.
  • The technology also has the potential to address the growing concern that AI models are exhausting public data sources and yielding diminishing returns.

Challenges and considerations: While synthetic data shows promise in addressing privacy concerns and expanding AI training capabilities, its implementation is not without challenges.

  • A Gartner report from April noted that synthetic data has unrealized potential in software engineering but must be deployed carefully.
  • Creating synthetic data can be resource-intensive, requiring specific testing stages for each use case.
  • There are concerns about model collapse, the idea that models may deteriorate after ingesting too much synthetic data. However, Mostly AI claims to avoid this issue by generating synthetic data once and applying it directly to downstream tasks.

Industry perspectives: The launch of Mostly AI’s synthetic text functionality has sparked discussions about the future of AI training and data privacy.

  • Mostly AI CEO Tobias Hann argues that leveraging both structured and unstructured synthetic data is crucial for safely training and deploying future generative AI solutions.
  • The company positions its technology as a solution to the perceived plateau in AI training due to the exhaustion of public data sources.
  • Even major players like Meta have used a combination of human and synthetic data to train advanced models like Llama 3.1 405B.

Looking ahead: As the AI industry continues to grapple with data privacy concerns and the need for high-quality training data, synthetic data generation may play an increasingly important role.

  • The effectiveness and widespread adoption of synthetic data in AI training remain to be seen, particularly in light of concerns about potential model collapse on a larger scale.
  • As more enterprises explore the use of synthetic data, its impact on AI development, privacy protection, and model performance will likely become clearer in the coming years.
Can synthetic data solve AI's privacy concerns? This company is betting on it

Recent News

Social network Bluesky says it won’t train AI on user posts

As social media platforms debate AI training practices, Bluesky stakes out a pro-creator stance by pledging not to use user content for generative AI.

New research explores how cutting-edge AI may advance quantum computing

AI is being leveraged to address key challenges in quantum computing, from hardware design to error correction.

Navigating the ethical minefield of AI-powered customer segmentation

AI-driven customer segmentation provides deeper insights into consumer behavior, but raises concerns about privacy and potential bias.