Waymo recently announced a major breakthrough, developing a new training model based on Google's Multimodal Large Language Model (MLLM) Gemini for its self-driving taxi development. This new model, called EMMA (End-to-End Multimodal Model for Autonomous Driving), is able to process sensor data to generate future trajectories for autonomous vehicles, helping driverless cars decide where to go and how to avoid obstacles.
The EMMA model is one of the first signs that leaders in autonomous driving plan to use MLLMs in their operations, suggesting that these LLMs can escape their current uses as chatbots, email managers and image generators and in a completely new environment on the road Find the application in .
Waymo's research team says MLLMs like Gemini provide interesting solutions for autonomous driving systems for two reasons: Chatbots are a "generalist" that "can provide more than that after being trained on a large amount of data crawled from the internet." The rich 'world knowledge' of content contained in ordinary driving logs; they demonstrate "excellent" reasoning ability through technologies such as "thought chain reasoning", imitating human reasoning by decomposing complex tasks into a series of logical steps.
Waymo's EMMA model performs well in trajectory prediction, object detection, and roadmap comprehension, but also has limitations such as the inability to integrate 3D sensor inputs from lidar or radar and can only process a small number of image frames at a time. Using MLLM to train self-driving taxis also poses risks, such as models may experience hallucinations or fail to complete simple tasks
. Therefore, Waymo said further research is needed to alleviate these problems and further develop the latest technologies in autonomous driving model architecture.
Waymo's breakthrough demonstrates the future development direction of autonomous driving technology and brings new hope and challenges to the industry.