With the rapid development of AI technology, language is no longer a barrier to communication. The latest breakthrough in AI simultaneous interpretation technology allows everyone to communicate seamlessly in different language environments without worrying about the inconvenience caused by language differences. This technology makes it possible for every ordinary person to "carry" a dedicated simultaneous interpreter when going out at any time. So, as the technology matures, which application scenarios will be affected first and become part of our daily lives?
Nowadays, many people are no longer unfamiliar with AI translation software. Most of these software rely on text translation technology and gradually incorporate AI speech synthesis functions to simulate the experience of simultaneous interpretation. For example, products such as iFlytek Simultaneous Interpretation, Youdao Translator and Tencent Translator are all representatives of this field. The working principle of this type of software is generally to first quickly identify the speaker's voice and convert it into text, then translate the text through powerful natural language processing algorithms, and finally convert each translated sentence into speech one by one and play it. , to achieve the effect of “real-time translation”.
However, this series of processes will inevitably bring about translation delays. In order to reduce delays, many translation software choose to abandon the voice playback function and only display the translated text. This method can translate speech in real time and continuously update it in the form of "subtitles". Even if the translation results are frequently adjusted and restarted due to changes in input, the user's reading experience will not be affected. By canceling voice playback, the updated translation content can be presented to users more quickly and continuously, thus improving the overall user experience.
The AI simultaneous interpretation model uses a completely different principle.
Thanks to the explosive progress of AI models in recent years, simultaneous interpretation models that pursue extremely low latency and directly translate speech into speech will gradually begin to appear in 2024. The purpose of this type of model is to directly or indirectly translate speech into speech in the target language. Among them, three models with outstanding performance have attracted special attention and are the solutions closest to real-person simultaneous interpretation: Meta (formerly Facebook)'s seamless-Streaming, StreamSpeech by the Institute of Computing Technology, Chinese Academy of Sciences, and Zhifu's simultaneous interpretation v3 model . This type of model is different from traditional AI translation software. It will try to imitate the behavior of human translation. It will not translate the speaker's speech immediately, but will "listen" while judging whether the current speech content is suitable for complete and complete translation. Do you need to hear more to translate.
Meta has significant influence in the AI field, especially in open source contributions and cutting-edge technology research. Its AI research department Meta AI has made important breakthroughs in multiple artificial intelligence fields. LLaMA (Large Language Model), as its open source model, has been widely used in the AI research community; and its open source framework PyTorch has become a global AI research and Mainstream tools in applications. Its simultaneous interpretation model Seamless-Streaming is also open source this time, allowing anyone to access its core principles. According to relevant papers published by it, Seamless-Streaming chooses to use the "EMMA" strategy to determine whether the translator should immediately translate the content heard. Still waiting for more input. In test use, the Seamless-Streaming model achieved accurate translation with a delay of about 3 seconds. Simply put, the translated content lags behind the original speech by about 3 seconds. Compared with the 15-second delay of the traditional AI translation software "simultaneous interpretation", it has achieved breakthrough optimization and has truly reached the level of real-person simultaneous interpretation delay. However, unfortunately, the accuracy is still insufficient compared to traditional AI translation. Especially in the translation test of complex languages such as Chinese, there are problems such as misunderstanding and inability to understand the "subtext".
As the highest academic institution and comprehensive scientific research center in China, the Chinese Academy of Sciences has also open sourced its research on the StreamSpeech model. According to its published paper, StreamSpeech mainly uses the method of checking "Alignments" to determine whether the translator is waiting or not. What is shocking is that the translation delay of this translation model reaches an astonishing 0.3 seconds. It is worth mentioning that this is even equivalent to the average human reaction time. This translation speed has far exceeded the level that real people can achieve. For real simultaneous interpreters, the time from the speaker speaking the word to hearing and understanding it in the brain is more than 0.3 seconds. Unfortunately, in the current open source part of the model, the model only supports English-French, English-Spanish, and English-German translation, and does not currently support Chinese. And in the test of the low-latency model, the smoothness of the translation was somewhat unsatisfactory, and the translated content was more like a "word-for-word machine translation" rather than a smooth translation of the meaning of the sentence. This problem occurs more because the delay is too low, causing the translation machine to be "forced" to translate some unfinished sentences.
Unlike Meta and the Chinese Academy of Sciences, Cicada Future is a small Chinese R&D team located in London. The team is currently in the process of financing and has not yet open sourced its model; its public information about this model only has a test experience window and a paper abstract. According to the abstract of its public paper, compared with the previous two models, the innovation of Zhifu’s simultaneous interpretation v3 model is that it directly hands over the function of “whether to wait for more input” to the translation model. It is currently testing two models internally: the mini model has low latency as its primary goal. According to internal testing, its translation delay is 1-3 seconds on average. When the speaker speaks clearly, the translation accuracy far exceeds the above two models. The large model is characterized by high accuracy. The average translation delay is 2-5 seconds. The accuracy and fluency of translation reach or even exceed the level of real people. It even supports accurate translation of Chinese and English mixed expressions, ancient poems, dialects and popular memes. . The fly in the ointment is that the number of places for internal testing of this model is currently limited, and the test is so popular that there are often queues to use it; its web page translation function is still some way away from being a commercial product, and is currently more like a "model display" web page.
To sum up, the technological breakthrough of AI simultaneous interpretation has brought more potential possibilities into reality. We are already far away from the scene in the movie "The Wandering Earth" where "two people speaking different languages can communicate normally after putting on headphones" It's approaching rapidly, visible to the naked eye. At present, software such as Zhifu Simultaneous Interpretation, which is closest to this scenario, has not yet started any product launch; in which scenario will it be used first? Who will be the first batch of users? It has become a next step in the market. A question.
Before thinking about this problem, let's take a look back at the core functions that simultaneous interpretation will implement: using the speaker's voice as real-time input, and simulating the translated voice to simulate the speaker's tone as real-time output. So given this function, we thought of some very good examples to share with you here, hoping to inspire everyone:
1. Flight attendant’s announcement
On international flights, flight attendants are usually required to speak two or even multiple languages. In addition to better serving passengers, what is more important is the broadcast of journey and other information during the flight, such as "entry instructions, flight transfer information" and other contents that need to be understood by passengers from different countries, so they need to speak different languages at the same time. Multilingualism is indeed a big challenge for flight attendants, and unclear oral expression will cause trouble to passengers. AI simultaneous interpretation may be able to help at this time. It only requires that the flight attendant can speak a language, and the AI is responsible for transmitting its content to the ears of passengers with the same tone, so that passengers from all over the world can listen to it on the plane. Listen clearly and make your journey more secure.
2. Online education
As the globalization of online education accelerates, more and more educational platforms and institutions hope to attract students from all over the world. However, language differences often become obstacles for students to obtain high-quality educational resources. Especially students whose mother tongue is a minority language, when learning non-native language courses, it is not only difficult to understand, but also affects their learning ability in core subjects such as mathematics and physics. As a result, many talented students are buried. AI simultaneous interpretation technology can just break this barrier and provide real-time translation services for lecturers, so that no matter what language the lecturer uses to teach, students can obtain the translated content simultaneously, so that they are no longer limited by language in a globalized education environment. .
AI simultaneous interpretation technology can provide real-time and accurate translation services for these occasions, avoiding delays and misunderstandings in information transmission, thereby promoting more efficient international communication and cooperation. Apart from that, what other possible usage scenarios are there in the future? Perhaps the next breakthrough point is hidden in a detail in our daily lives. As technology continues to improve, AI simultaneous interpretation will gradually enter more daily application scenarios and become an indispensable part of future global communication.