Text-to-image generation of high-resolution, photorealistic images has always been a difficult problem in the field of computer vision. Although traditional generation methods such as diffusion models and transformation autoregressive models can generate high-quality images, they face problems such as huge consumption of computing resources and loss of details. The new framework "Infinity" proposed by ByteDance aims to solve these challenges. It significantly improves generation efficiency and image quality through innovative bit-level tagging and infinite vocabulary classifiers.
In the field of image generation, the task of high-resolution and photorealistic images has always faced multiple challenges, especially in the text-to-image synthesis process. Traditional generative methods mostly rely on diffusion models and transformation autoregressive (VAR) frameworks.
Although these models are capable of producing high-quality images, they consume large amounts of computing resources, making them inflexible for real-time applications. At the same time, the VAR model is prone to cumulative errors when processing discrete markers, resulting in the loss of details in the generated image, thus affecting the realism of the image.
To overcome these shortcomings, ByteDance’s research team launched a new framework called “Infinity”, which is designed to improve the efficiency and quality of text-to-image synthesis.
Infinity achieves a more fine-grained representation by introducing bit-level tags instead of traditional index-level tags, thereby significantly reducing quantization errors and improving the realism of the generated images. In addition, the framework uses an Infinite Vocabulary Classifier (IVC) to extend the token vocabulary to 2^64, significantly reducing memory and computing requirements.
The Infinity architecture mainly consists of three parts: a bit-level multi-scale quantized tagger that converts image features into binary tags for computational overhead; a transformer-based autoregressive model that predicts residuals based on textual hints and previous outputs. difference; and a self-correction mechanism that introduces random bit flips during the training process to improve the model's robustness to errors. The research team used large data sets such as LAION and OpenImages for training, and made significant progress by gradually increasing the image resolution from 256×256 to 1024×102.
After evaluation, Infinity showed excellent performance on key indicators, with a GenEval score of 0. and a Fréchet Inception Distance (FID) reduced to 3.48, demonstrating its improvement in generation speed and quality. Infinity can generate 1024×1024 high-resolution images in 0.8 seconds, demonstrating its efficiency and reliability. The images generated by the system are not only visually realistic and rich in detail, but also accurately respond to complex text instructions, resulting in high human preference scores.
The launch of Infinity marks a new benchmark in high-resolution text-to-image synthesis, driving the further development of generative AI by solving long-standing scalability and detail quality issues with an innovative design.
Paper: https://arxiv.org/abs/2412.04431
Highlights:
? **Innovative Framework Infinity:** The Infinity framework launched by Bytedance greatly improves the efficiency of high-resolution image generation through bit-level tokenization and unlimited vocabulary classifiers.
⚡ **Excellent performance:** Infinity surpasses existing models in key evaluation indicators and can generate 1024×1024 high-quality images in 0.8 seconds.
?️ ** Authentic details and responsiveness: ** The generated images are not only visually realistic, but also accurately respond to complex text prompts, showing high human preference scores.
All in all, the Infinity framework provides an efficient and high-quality solution for high-resolution text-to-image generation, achieving significant breakthroughs in speed, image quality, and responsiveness to complex text instructions, providing a powerful platform for generative The development of AI has set a new milestone.