Earlier in the year, Meta unveiled Massively Multilingual Speech (MMS): AI research models that recognize over 4,000 languages and can perform text-to-speech and speech-to-text conversions across more than 1,100 languages. At the time, these numbers were unmatched by any other model. This focus on maintaining extreme language diversity supports Meta’s mission of bringing people together globally through AI.
Meta’s commitment to connecting the world through AI is exemplified by the introduction of the Seamless Communication family, comprising four groundbreaking NLP models:
SeamlessM4T v2
SeamlessM4T v2 is the multitask and multilingual model trained on 4.5 million hours of speech data that provides the foundation for the other models. According to Meta, SeamlessM47 v2 is capable of:
- Speech recognition for nearly 100 languages
- Speech-to-text translation for nearly 100 input and output languages
- Speech-to-speech translation, supporting nearly 100 input languages and 36 (including English) output languages
- Text-to-text translation for nearly 100 languages
- Text-to-speech translation, supporting nearly 100 input languages and 35 (including English) output languages
SeamlessM4T v2 boasts a translation quality higher than all other existing translation tools, serving as the best backbone for SeamlessExpressive, SeamlessStreaming, and Streamless.
SeamlessExpressive
This model is focused on maintaining the expressive intricacies of speech between languages. This includes tone/emotion, speech rate, pauses, and vocal style. While there are already text-to-speech models capable of replicating general emotions, SeamlessExpressive differentiates itself by being personalized to the specific person speaking and carrying this personalized style of speech across numerous languages.
SeamlessStreaming
SeamlessStreaming delivers the translations from SeamlessM4T v2 in real-time, beginning the translation process before the speaker has even finished their sentence. To accomplish this, it decides when the speaker has given enough context to produce the translation, intelligently choosing when to listen for more input and when to begin generating output. SeamlessStreaming considers the inherent variance between different languages’ sentence structures and grammar rules.
Seamless
This model combines the multilingual capabilities of SeamlessM4T v2, SeamlessExpressive’s personalized imitations, and the low latency delivery of Seamless Streaming to create one unified system for use in applications.
Not only can anybody try out these models to see how well they work for themselves and suggest improvements or report bugs, but all of these models are open-source, meaning that any developer or researcher can build on the systems and make improvements or changes to suit their own needs. Additionally, to ensure safety, Meta signs each audio translation produced with an inaudible watermark to ensure the authenticity of each audio clip.
The real-world applications of Seamless are extremely impressive. In a corporate setting, it can perform live translations of conferences for large companies that are based in many different parts of the world or with employees that come from diverse backgrounds. This also applies to government press conferences, where populations with a large number of migrants or intense linguistic diversity could benefit from such translation tools. Additionally, Seamless could improve the accessibility of educational resources for regions whose native dialects are often overlooked. By offering immediate and automatic translations in multiple languages, Seamless could promote inclusivity and equal access to educational opportunities for everyone.
With these models, Meta has opened the doors for much more progress in machine translation in a way that fosters potentially revolutionary innovations.