The editor of Downcodes will take you to learn about Meissonic, a text-to-image generation model with only 1 billion parameters that can generate 1024×1024 high-definition images. It breaks through the limitations of models such as Stable Diffusion and elevates non-autoregressive mask image modeling (MIM) technology to a new level. Its performance and efficiency are comparable to top diffusion models such as SDXL. Meissonic's innovation lies in its unique architectural design, advanced positional encoding strategy, and optimized sampling conditions, which enable it to run on consumer-grade GPUs without additional optimization. Even more surprising is that it can easily generate images with solid color backgrounds, which usually require complex adjustments in diffusion models.
The core of Meissonic lies in a series of architectural innovations, advanced position encoding strategies and optimized sampling conditions. These improvements significantly improve the performance and efficiency of MIM. Additionally, Meissonic leverages high-quality training data, integrates micro-conditioning based on human preference scores, and employs feature compression layers to further enhance image fidelity and resolution.
Unlike large diffusion models such as SDXL and DeepFloyd-XL, Meissonic only has 1 billion parameters, but can generate high-quality images with 1024×1024 resolution, and can run on consumer-grade GPUs with only 8GB of video memory without any additional Model optimization. Additionally, Meissonic can easily generate images with solid color backgrounds, which in diffusion models often requires model fine-tuning or noise offset adjustments.
In order to achieve efficient training, Meissonic's training process is broken down into four carefully designed stages:
The first stage: Understand basic concepts from massive data. Meissonic uses the filtered LAION-2B data set to train at 256×256 resolution to learn basic concepts.
Phase 2: Align text and images using long tips. The training resolution is increased to 512×512, and high-quality synthetic image-text pairs and internal datasets are used to improve the model's ability to understand long descriptive cues.
Stage 3: Master feature compression to achieve higher resolution generation. By introducing a feature compression layer, Meissonic can seamlessly transition from 512×512 to 1024×1024 generation and train with a selection of high-quality high-resolution image-text pairs.
Stage 4: Optimizing high-resolution aesthetic image generation. At this stage, the model is fine-tuned using a smaller learning rate and human preference scores are added as micro-conditions to enhance the model's performance in generating high-quality images.
Meissonic demonstrates superior performance and efficiency across a range of quantitative and qualitative metrics, including HPS, MPS, GenEval benchmarks and GPT4o evaluations. Compared with DALL-E2 and SDXL, Meissonic achieves competitive performance in both human performance and text alignment, while also demonstrating its high efficiency.
Additionally, Meissonic excels at zero-sample image-to-image editing. On the EMU-Edit dataset, Meissonic achieved leading results in seven different operations, including background change, image content change, style change, object removal, object addition, local modification, and color/texture change, all of which None require training or fine-tuning on image editing-specific data or instruction sets.
Project address: https://github.com/viiika/Meissonic
Paper address: https://arxiv.org/pdf/2410.08261
With its efficiency and high performance, Meissonic brings new possibilities to the field of image generation. Its lightweight design makes it easier to be used by mass users and also provides new ideas for future research directions. Interested friends can visit the project address and thesis address for more information.