The editor of Downcodes learned that Google DeepMind and the Massachusetts Institute of Technology (MIT) have achieved a major breakthrough in the field of text-to-image generation. The new autoregressive model Fluid they developed shows excellent performance at a parameter scale of 10.5 billion, subverting the industry's understanding of autoregressive models in the field of image generation. The core of this research lies in the innovative introduction of continuous word elements and random generation order, which significantly improves the performance and scalability of the model and brings a new direction to image generation technology.
Google DeepMind and the Massachusetts Institute of Technology (MIT) recently released a major research result. The new autoregressive model Fluid developed by the research team has made breakthrough progress in the field of text-to-image generation. The model has excellent performance after being expanded to a scale of 10.5 billion parameters.
This research subverts the common perception in the industry. Previously, although autoregressive models dominated the field of language processing, they have been considered inferior to diffusion models such as Stable Diffusion and Google Imagen3 in image generation. The researchers significantly improved the performance and scalability of the autoregressive model by innovatively introducing two key design factors: using continuous word elements instead of discrete word elements, and introducing randomly generated order instead of fixed order.
In terms of image information processing, continuous word elements have obvious advantages. Traditional discrete tokens encode image regions into codes in a limited vocabulary. This approach inevitably leads to information loss, and it is difficult for even large models to accurately generate detailed features such as symmetrical eyes. The continuous word elements can save more accurate information and significantly improve the quality of image reconstruction.
The research team also innovated the image generation sequence. Traditional autoregressive models usually generate images in a fixed order from left to right and top to bottom. The researchers tried a randomized sequential approach, allowing the model to predict multiple pixels at any location at each step. This method performs well in tasks that require a good grasp of the overall image structure, and achieved significant advantages in the GenEval benchmark test that measures the matching of text and generated images.
The actual performance of the Fluid model confirms the value of the research. After scaling to 10.5 billion parameters, Fluid outperformed existing models in multiple important benchmarks. It is worth noting that the small Fluid model with only 369 million parameters has reached the FID score (7.23) of the Parti model with 20 billion parameters on the MS-COCO data set.
This research result shows that autoregressive models like Fluid are likely to become powerful alternatives to diffusion models. Compared with diffusion models that require multiple forward and reverse passes, Fluid only needs a single pass to generate images. This efficiency advantage will be more obvious as the model is further expanded.
This research brings new possibilities to the field of text-to-image generation, and the emergence of the Fluid model also marks the rise of autoregressive models in the field of image generation. In the future, we can look forward to more applications and improvements based on Fluid models to further promote the advancement of artificial intelligence image generation technology. The editor of Downcodes will continue to pay attention to the latest developments in this field and bring more exciting content to readers.