Significant progress has been made in the field of image generation, but limitations of existing models have hindered the unity of language vision models. This article introduces a new text-to-image model called Meissonic, which uses non-autoregressive masked image modeling (MIM) technology to achieve state-of-the-art diffusion models (such as SDXL) with only 1 billion parameters required. ) A considerable image generation quality. Meissonic significantly improves MIM performance and efficiency with its architectural innovation, location coding strategies, and optimized sampling conditions, and achieves the generation of 1024×1024 resolution images on consumer GPUs.
At the heart of Meissonic is a range of architectural innovations, advanced position coding strategies, and optimized sampling conditions that significantly improve MIM performance and efficiency. In addition, Meissonic also utilizes high-quality training data, integrates micro-conditions based on human preference scores, and adopts feature compression layers to further enhance the fidelity and resolution of the image.
Unlike large diffusion models such as SDXL and DeepFloyd-XL, Meissonic has only 1 billion parameters, but it can generate high-quality images with 1024×1024 resolution and can run on consumer-grade GPUs with only 8GB of video memory without any additional Model optimization. Additionally, Meissonic makes it easy to generate images with solid color backgrounds, which often require model fine-tuning or noise offset adjustment in diffusion models.
To achieve efficient training, Meissonic's training process is broken down into four carefully designed stages:
The first stage: Understand the basic concepts from massive data. Meissonic uses the filtered LAION-2B dataset to train at 256×256 resolution to learn basic concepts.
Stage 2: Align text and images with long prompts. The training resolution is improved to 512×512, and high-quality synthetic image text pairs and internal datasets are used to improve the model's ability to understand long descriptive cues.
Stage 3: Master feature compression to achieve higher resolution generation. By introducing feature compression layers, Meissonic can seamlessly transition from 512×512 to 1024×1024 generation and trained with selected pairs of high-quality high-resolution image text.
Stage 4: Optimize high-resolution aesthetic image generation. At this stage, the model uses a smaller learning rate for fine-tuning and adds human preference scores as microconditions to enhance the performance of the model in generating high-quality images.
Meissonic demonstrates superior performance and efficiency through a range of quantitative and qualitative metrics, including HPS, MPS, GenEval benchmarking and GPT4o evaluation. Compared with DALL-E2 and SDXL, Meissonic has achieved competitive performance in both human performance and text alignment, while also showing its efficiency.
Additionally, Meissonic has performed well in zero-sample image-to-image editing. On the EMU-Edit dataset, Meissonic has achieved leading results in seven different operations, including background changes, image content changes, style changes, object removal, object additions, local modifications, and color/texture changes, all of which None of them need to train or fine-tune on image edit-specific data or instruction sets.
Project address: https://github.com/viiika/Meissonic
Paper address: https://arxiv.org/pdf/2410.08261
In summary, Meissonic models have made significant breakthroughs in efficiency and image generation quality, providing new directions for the development of future language vision models. Its lightweight features allow it to run on consumer hardware and demonstrate its powerful capabilities in zero-sample image editing, with broad application prospects.