In recent years, AI painting technology has advanced rapidly, but there are still some seemingly simple scenes that are difficult to present perfectly. A research team from Shanghai Jiao Tong University found that AI repeatedly failed to generate the scene of "Ice Coke in a tea cup", which triggered academic attention on the problem of text-image misalignment. The team delved into this "teacup problem" and proposed a new method called Mixture of Concept Experts (MoCE), which effectively solves the hidden concept mismatch problem in AI image generation.
In the field of artificial intelligence, the capabilities of AI painters have been constantly making breakthroughs and improvements. However, even the most advanced AI image generation models can struggle with certain seemingly simple tasks. Recently, Zhao Juntu, a doctoral candidate at Shanghai Jiao Tong University, and his team discovered in their research that AI showed unexpected difficulties when generating the scene of "Ice Coke in a Tea Cup."
This phenomenon has attracted the attention of academic circles and is called text-image misalignment. In October 2023, when the AI image generation model was just emerging, Zhao Juntu and his team tried it and found that when AI painters constructed this scene, they often drew a transparent glass filled with iced Coke instead of a tea cup. . Even when tried in July 2024 using state-of-the-art models, the results were still unsatisfactory.
In order to explore this issue in depth, the research group of Professor Wang Dequan of Shanghai Jiao Tong University classifies this problem as containing hidden variables in the upcoming paper "Lost in Translation: Latent Concept Misalignment in Text-to-Image Diffusion Models" Misalignment problem (Latent Concept Misalignment, referred to as LC-Mis). They designed a system based on large language models (LLMs) to use the human thinking contained in LLMs to help quickly collect concept pairs with similar problems.
The research team proposed a method called Mixture of Concept Experts (MoCE), which integrated the rules of sequential painting into the multi-step sampling process of diffusion models, and successfully recovered the missing teacup.
It divides the entire sampling process into two stages: the first stage only provides easily overlooked concepts, and the second stage uses complete text prompts. With this approach, MoCE is able to more precisely control the alignment between text and images when generating images.
The MoCE method significantly reduces the proportion of level 5 LC-Mis concept pairs, and even surpasses Dall・E3 (October 2023 version) which requires a large amount of data annotation costs to a certain extent.
In addition, the research team also found that existing automated evaluation indicators have obvious flaws when facing this type of new problems. For example, some evaluation indicators give a lower score to iced Coke in a tea cup, but give a higher score to iced Coke in a clear glass. This suggests that even the tools themselves for evaluating AI performance can have biases and limitations.
The researchers plan to explore more complex LC-Mis scenarios in future work and develop learnable search algorithms to reduce the number of iterations. They also plan to expand the types of models, model versions, and sampler types used in the dataset, and continue to iterate on the dataset collection algorithm to enhance and expand the dataset.
This research not only provides a new perspective for understanding the limitations of AI in image generation, but also provides new ideas and methods for improving AI's image generation capabilities. As technology continues to advance, we expect AI to make greater breakthroughs in understanding and reproducing human creativity.
Project address: https://lcmis.github.io/
Paper: https://arxiv.org/pdf/2408.00230
This study on the "teacup problem" in AI image generation reveals the limitations of AI models in handling subtle concepts and also provides valuable reference for the future development direction of AI technology. The MoCE method proposed by the research team and the reflection on existing evaluation indicators will push AI image generation technology to the next level.