Efficiently generating high-quality long videos has always been a major challenge in the field of computer vision. Meta AI researchers proposed a training-free acceleration method called AdaCache in response to the slow inference of existing diffusion Transformer models (DiTs). AdaCache cleverly utilizes video content differences, customizes caching strategies for each video, and introduces a motion regularization scheme to dynamically allocate computing resources based on video motion content, thereby significantly improving inference speed while ensuring the quality of generation.
Generating high-quality, time-continuous video requires a lot of computing resources, especially for longer time spans. Although the latest diffusion Transformer models (DiTs) have made significant progress in video generation, this challenge is exacerbated by the slower reasoning due to their reliance on larger models and more complex attention mechanisms. To solve this problem, researchers at Meta AI proposed a training-free method called AdaCache to accelerate video DiTs.
The core idea of AdaCache is based on the fact that “not all videos are the same”, which means that some videos require fewer denoising steps than others to achieve reasonable quality. Based on this, the method not only caches the calculation results during the diffusion process, but also designs a customized cache strategy for each video generation, thereby maximizing the trade-off between quality and latency.
The researchers further introduced a motion regularization (MoReg) scheme, which uses video information in AdaCache to control the allocation of computing resources based on the motion content. Since video sequences containing high-frequency textures and large amounts of motion content require more diffusion steps to achieve reasonable quality, MoReg can better allocate computing resources.
Experimental results show that AdaCache can significantly improve inference speed (for example, up to 4.7 times faster in Open-Sora720p -2s video generation) without sacrificing the generation quality. In addition, AdaCache has good generalization capabilities and can be applied to different video DiT models, such as Open-Sora, Open-Sora-Plan, and Latte. AdaCache has significant advantages in both speed and quality compared to other training-free acceleration methods such as ∆-DiT, T-GATE, and PAB.
User research shows that users prefer AdaCache-generated videos compared to other methods and believe that their quality is comparable to the benchmark model. This study confirms the effectiveness of AdaCache and makes an important contribution to the field of efficient video generation. Meta AI believes that AdaCache can be widely used and promote the popularization of high-fidelity long video generation.
Paper: https://arxiv.org/abs/2411.02397
Project homepage:
https://adacache-dit.github.io/
GitHub:
https://github.com/AdaCache-DiT/AdaCache
In short, AdaCache, as an efficient video generation acceleration method, provides new possibilities for the generation of high-fidelity long videos, and its significant performance improvement and good user experience make it have broad prospects in future applications. This research by Meta AI has brought important breakthroughs in the field of efficient video generation.