There are more and more screens on mobile phones, tablets, computers, and TVs, and the operations are becoming more and more complex. Is it dazzling you? Apple recently launched a king bomb - Ferret-UI2, a super powerful UI understanding model, claiming to unify the world. !
This is no bragging, the goal of Ferret-UI2 is to become a true hexagon warrior, able to understand the user interface on various platforms, whether it is iPhone, Android, iPad, web or AppleTV, it can easily win.
One of the highlights of Ferret-UI2 is its multi-platform support. Unlike Ferret-UI, which is limited to mobile platforms, Ferret-UI2 is able to understand UI screens from various devices such as tablets, web pages, and smart TVs. This multi-platform support enables it to adapt to today's diverse device ecosystem and provide users with a wider range of application scenarios.
In order to improve UI perception, Ferret-UI2 introduces dynamic high-resolution image encoding technology and adopts an enhancement method called "Adaptive Grid". With this approach, Ferret-UI2 is able to maintain perception at the native resolution of UI screenshots, allowing for more accurate recognition of visual elements and their relationships.
Additionally, Ferret-UI2 leverages high-quality training data to learn basic and advanced tasks. For basic tasks, Ferret-UI2 converts simple reference and positioning data into conversational form, allowing the model to build a basic understanding of various UI screens. For advanced tasks that focus more on user experience, Ferret-UI2 uses GPT-4o-based "marker set visual cues" technology to generate training data and replaces the simple clicks of the previous method with single-step user-centered interactions. instruction.
To evaluate the performance of Ferret-UI2, the researchers built 45 benchmarks covering five platforms, including 6 basic tasks and 3 advanced tasks for each platform. Additionally, they used public benchmarks such as GUIDE and GUI-World. The results show that Ferret-UI2 outperforms Ferret-UI in all tested benchmarks, especially achieving significant improvements on advanced tasks, demonstrating its versatility in handling cross-platform UI understanding tasks.
Ablation studies further show that both architectural improvements and dataset improvements in Ferret-UI2 contribute to performance improvements, with the new dataset having a more significant impact on more challenging tasks. In addition, Ferret-UI2 also performs well in cross-platform transfer learning, especially showing good generalization capabilities between iPhone, iPad and Android platforms.
Model address: https://huggingface.co/jadechoghari/Ferret-UI-Llama8b
Paper address: https://arxiv.org/pdf/2410.18967