×
Microsoft’s VALL-E 2 Achieves Human-Like Speech, Deemed Too Risky for Release
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

Microsoft’s VALL-E 2 text-to-speech AI reaches claimed “human parity” in recreating convincing human voices, but is deemed too risky for public release due to potential misuse.

Key advancements enable more natural speech generation: VALL-E 2 incorporates two key features to generate high-quality, human-like speech from just a few seconds of audio:

  • “Repetition Aware Sampling” improves the AI’s text-to-speech conversion by addressing repetitions of small language units called “tokens,” preventing infinite loops and making the speech pattern more natural and fluid.
  • “Grouped Code Modeling” enhances efficiency by reducing the sequence length of tokens the model processes in a single input, speeding up speech generation and managing difficulties with long sound strings.

Benchmarking suggests VALL-E 2 matches or exceeds human speech quality: Microsoft researchers used speech libraries and an evaluation framework called ELLA-V to assess VALL-E 2’s performance:

  • Tests on the LibriSpeech and VCTK datasets showed VALL-E 2 surpassed previous zero-shot text-to-speech systems in speech robustness, naturalness, and speaker similarity.
  • The researchers claim VALL-E 2 is the first of its kind to reach “human parity” on these benchmarks, meaning its generated speech matched or exceeded the quality of actual human speech.

Ethical concerns prevent public release despite potential applications: While VALL-E 2 could theoretically be used for applications like education, entertainment, chatbots, and accessibility features, Microsoft is restricting access due to risks of misuse:

  • The researchers stated they have no plans to incorporate VALL-E 2 into a product or expand public access, as it carries potential risks like voice spoofing or impersonation.
  • This aligns with growing concerns around voice cloning and deepfake technology, with other AI companies like OpenAI placing similar restrictions on their voice tech.

Analyzing deeper: While VALL-E 2 represents an impressive advancement in AI-generated speech, the decision not to release it highlights the complex ethical challenges surrounding increasingly powerful AI systems. As the technology progresses to convincingly recreate elements of human behavior and communication, robust safeguards and oversight will be critical to mitigate risks and ensure responsible development and deployment. Key questions remain around how to effectively prevent malicious applications of voice cloning and deepfakes as the underlying AI continues to advance.

AI speech generator 'reaches human parity' — but it's too dangerous to release, scientists say

Recent News

ChatGPT may soon get a ‘Live Camera’ feature — here’s what we know

ChatGPT's upcoming mobile camera integration enables real-time visual analysis while maintaining conversation, though with clear safety limitations for users.

Amazon invests $4B more in AI startup Anthropic

Amazon strengthens its AI position with an additional $4 billion investment in Anthropic, as early tests reveal its homegrown AI assistants lag behind competitors.

New research from Leanlab highlights barriers to AI adoption in education

Teachers report basic AI education tools are too slow and simplistic for practical classroom use, citing hour-long delays and content that fails to engage students.