Start United States USA — software Meta announces Voicebox, a generative model for multiple voice synthesis tasks

Meta announces Voicebox, a generative model for multiple voice synthesis tasks

Von

June 18, 2023

One of the interesting applications of Voicebox is voice sampling. The model can generate various speech samples from a single text sequence.
Meta Platforms’ artificial intelligence research arm introduced Voicebox, a machine learning model that can generate speech from text. What sets Voicebox apart from other text-to-speech models is its ability to perform many tasks that it has not been trained for, including editing, noise removal, and style transfer.
The model was trained using a special method developed by Meta researchers. While Meta has not released Voicebox due to ethical concerns about misuse, the initial results are promising and can power many applications in the future.‘Flow Matching’
Voicebox is a generative model that can synthesize speech across six languages, including English, French, Spanish, German, Polish, and Portuguese. Like large language models, it has been trained on a very general task that can be used for many applications. But while LLMs try to learn the statistical regularities of words and text sequences, Voicebox has been trained to learn the patterns that map voice audio samples to their transcripts.
Such a model can then be applied to many downstream tasks with little or no fine-tuning. “The goal is to build a single model that can perform many text-guided speech generation tasks through in-context learning,” Meta’s researchers write in their paper (PDF) describing the technical details of Voicebox.
The model was trained Meta’s “Flow Matching” technique, which is more efficient and generalizable than diffusion-based learning methods used in other generative models. The technique enables Voicebox to “learn from varied speech data without those variations having to be carefully labeled.” Without the need for manual labeling, the researchers were able to train Voicebox on 50,000 hours of speech and transcripts from audiobooks.
The model uses “text-guided speech infilling” as its training goal, which means it must predict a segment of speech given its surrounding audio and the complete text transcript.