Supports thousands of languages
Many of the world’s languages are in danger of disappearing, and the limitations of current speech recognition and generation technology will only accelerate this trend. We want to make it easier for people to access information and use devices in their preferred language, and today we’re announcing a range of artificial intelligence (AI) models that can help them do just that.
Massively Multilingual Speech (MMS) models expand text-to-speech and speech-to-text technology from about 100 languages to more than 1,100 – more than 10 times as many as before – and can also identify more than 4,000 spoken languages, 40 times more than before.
There are also many use cases for speech technology – from virtual and augmented reality technology to messaging services – that can be used in a person’s preferred language and can understand anyone’s voice.
We open our models and code so that others in the research community can build on our work and help preserve the world’s languages and bring the world closer together.
Our approach
Collecting audio data for thousands of languages was our first challenge because the largest existing speech datasets cover no more than 100 languages. To overcome this, we turned to religious texts, such as the Bible, which have been translated into many different languages and whose translations have been widely studied for text-based language translation research.
These translations have publicly available audio recordings of people reading these texts in different languages. As part of the MMS project, we created a dataset of New Testament readings in more than 1,100 languages, yielding an average of 32 hours of data per month. language.
By consider unlabeled recordings of various other Christian religious readings, we rose the number of available languages for more than 4,000. Although these data are from a specific domain and are often read by male speakers, our analysis shows that our models perform equally well for male and female voices. And even if the content of the audio recordings is religious, our analysis shows that this does not bias the model to produce more religious language.
Forward
In the future, we will increase the coverage of MMS to support even more languages, and also tackle the challenge of handling dialects, which are often difficult for existing speech technology.
Learn more about MMS.