This year, Meta released Seamless Communication, a collection of AI models designed to facilitate numerous different kinds of translation across thousands of languages. Meta emphasized the importance of maintaining transparency throughout the development process. As a result, they’ve made all these models open-source and made demos of each of them available for anybody to test. After trying Meta’s open demos of the models myself, here is my review:
Note: I focused my experimentation on English, French, and Thai, the languages I’m familiar with. Other languages may have specific nuances or issues I didn’t find.
SeamlessM4T v2
This is the main brain behind the translation, providing the foundation for the other models to work. It builds on SeamlessM4T, the first version of the model. After trying both versions, I immediately noticed an improvement in translation speed. I also noticed that the original version would sometimes interrupt translated audio with a loud buzzing noise, an issue that was resolved in the next version.
One thing I tried with SeamlessM4T v2 was mixing multiple languages into one audio recording and attempting to translate it, which, surprisingly, sometimes worked. Combining spoken English, French, and Thai into a single sentence and translating it into Thai resulted in a clear, cohesive, and accurate translation. However, translating the mixed audio into English or French always failed, with Thai remaining undetected or mistaken for other English or French words while the other languages were correctly translated. This is consistent with the general theme of NLP models, where many languages remain unrepresented or underserved.
After more experimentation, I noticed that I had to be sure to enunciate very clearly in French and Thai, my non-native languages. It enabled me to see how mispronouncing a tone in a Thai word could completely change the meaning of a sentence. Given this, Seamless could be an effective tool for language learning, particularly in improving pronunciation.
Seamless Expressive
This model is focused on maintaining the speaker’s voice and tone through the translation. I found that it worked surprisingly well, effectively conveying the same volume, speed, and emotion. The model replicates speech patterns, including slurred words and emphasis, though not always flawlessly. Nonetheless, it holds promise for various real-world applications. Such capabilities would be extremely useful for voice dubs in movies or TV shows. Specifically, cartoons might benefit greatly from this technology, as the characters tend to have unique voices that would be hard for humans to replicate in another language. Additionally, cartoon dialogue tends to have very exaggerated emotions and tones, which would be easy for Seamless Expressive to recognize and replicate.
Seamless Streaming
This model is focused on actively listening and translating in real-time. In my use, it wasn’t as impressive as the other models. Although it attempted to live-translate the videos I gave it, there was a consistent delay of at least 3-4 seconds between the original audio and the translated playback. At times it would pause for >10 seconds before resuming a translation that lagged well behind the video. It always seemed to hesitate every couple of words, sounding very choppy in the end.
When it came to live Speech-to-Text translation, the results were a bit better. Although there was still a delay between the audio input and the translated text output, it was relatively faster and smoother compared to Speech-to-Speech translation.
The overall potential of these models is obvious. In fact, there are already plenty of machine translation tools used in various real-world settings. However, what sets these models apart is their accuracy as well as the number of languages represented. Seamless can detect and translate hundreds of languages that I’ve never seen recognized by other translation tools. As these models continue to evolve, they’ll continuously build effective cross-cultural communication and language learning on a global scale.