In an exciting announcement from Meta Platforms, Inc.’s AI research division, a new player has entered the field of generative AI for speech. Named Voicebox, this AI model is touted as the first to generalize to speech-generation tasks it wasn’t specifically trained for, all while delivering state-of-the-art performance. As we stand on the precipice of this new era, it’s hard not to be swept up in the excitement, but a healthy dose of scientific skepticism is in order.
Voicebox sets itself apart from previous models by learning from raw audio and its corresponding transcription, enabling it to modify any part of a given sample, not just the end of an audio clip. This is a significant departure from the norm, and while the potential applications are intriguing, the scientific community will undoubtedly be keen to scrutinize this approach.
Built on a method called Flow Matching, Voicebox claims to outperform the current state-of-the-art English model, VALL-E, in zero-shot text-to-speech in terms of both intelligibility and audio similarity while being up to 20 times faster.
Voicebox’s capabilities are extensive: it can synthesize speech across six languages and perform noise removal, content editing, style conversion, and diverse sample generation. It was trained on over 50,000 hours of recorded speech and transcripts from public domain audiobooks in English, French, Spanish, German, Polish, and Portuguese.
While the potential for misuse and unintended harm is acknowledged, Meta AI has built a highly effective classifier to distinguish between authentic speech and audio generated with Voicebox. The model and code are not publicly available at this time, but the team is sharing audio samples and a research paper detailing their approach and results.
Voicebox represents a significant step forward in generative AI research. It could usher in a new era of generative AI for speech, and the anticipation is palpable. However, as with all scientific breakthroughs, the proof will be in the pudding. The AI research community will be watching closely, ready to celebrate if Voicebox lives up to its promise.
Meta has made the Voicebox research paper (titled “Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale”) available for those who want further details about this project.