Consistency Models Made Easy

Zhengyang Geng, William Luo, Ashwini Pokle, Zico Kolter

Released on March 26th, 2024

<aside> 🐌 Diffusion Models: Powerful 🔋 but slow at inference 🐌.

</aside>

<aside> 🚀 Consistency Models: Fast at inference 🚀 but hard to train 🤷‍♀️.

</aside>

<aside> 💫 Tune your pretrained diffusion models to consistency models!

</aside>

In recent years, Diffusion Models (DMs) [1] and Score-based Generative Models (SGMs) [2] have vastly changed the landscape of visual content generation. These powerful models have enabled the creation of stunning, high-quality images that were once thought impossible. However, generating images from pretrained DMs comes with a significant drawback: it requires hundreds to thousands of solver steps due to the curved generation trajectory.

More recently, OpenAI introduced a revolutionary video diffusion model, Sora, which is capable of generating photorealistic, minute-long, high-resolution videos that will leave you in awe. But hold onto your wallets! 💸

Deploying Sora into open access is a huge challenge. Cost estimations suggest that it could be multiple orders of magnitude more expensive than Large Language Models (LLMs). “Sora can at most generate about 5 minutes of video per hour per Nvidia H100 GPU,” per [estimation](https://www.factorialfunds.com/blog/under-the-hood-how-openai-s-sora-model-works#:~:text=For inference%2C we estimate that Sora can at most generate about 5 minutes of video per hour per Nvidia H100 GPU. Compared to LLMs%2C inference for diffusion-based models like Sora is multiple orders of magntitude more expensive.). 🫠

This is where efficient generative models come in to save the day (and your budget)! We're here to explore the exciting world of Consistency Models and how they can revolutionize the way we generate high-quality content without breaking the bank.

[Video generated by OpenAI Sora. Prompt: A street-level tour through a futuristic city which in harmony with nature and also simultaneously cyberpunk / high-tech. The city should be clean, with advanced futuristic trams, beautiful fountains, giant holograms everywhere, and robots all over. Have the video be of a human tour guide from the future showing a group of extraterrestrial aliens the coolest and most glorious city that humans are capable of building. Credits to @sama and @DeeperThrill. Compressed by X. 🤣](https://prod-files-secure.s3.us-west-2.amazonaws.com/269642c4-9250-402d-ad7e-997d19150d8d/3b5fc2b6-2df5-4693-8ce5-5636ff1425bd/Untitled.mp4)

Video generated by OpenAI Sora. Prompt: A street-level tour through a futuristic city which in harmony with nature and also simultaneously cyberpunk / high-tech. The city should be clean, with advanced futuristic trams, beautiful fountains, giant holograms everywhere, and robots all over. Have the video be of a human tour guide from the future showing a group of extraterrestrial aliens the coolest and most glorious city that humans are capable of building. Credits to @sama and @DeeperThrill. Compressed by X. 🤣

Enter Consistency Models (CMs) [3,4], a new family of generative models that are closely related to diffusion models but offer a faster and more efficient alternative. The best results so far were presented at ICLR 2024, where the improved Consistency Training (iCT) [4] pushed the quality of images generated by 1-step CMs trained from scratch to a level comparable with state-of-the-art DMs trained over thousands of steps, making them a game-changer in generative models.

Despite the impressive results, the adoption of Consistency Models has been hindered by complex design choices and very high training costs, making it challenging for the community to fully embrace this new model family.

$Fig. 1. 1-step sampling quality during training. Consistency Models are pair-preserving, i.e., retaining the correspondence between the noise $\epsilon$ and the image $\mathbf{x}$ and gradually adding more fine-grained features as the training progresses.$

Fig. 1. 1-step sampling quality during training. Consistency Models are pair-preserving, i.e., retaining the correspondence between the noise $\epsilon$ and the image $\mathbf{x}$ and gradually adding more fine-grained features as the training progresses.

This raises two fundamental questions:

How can we understand Consistency Models in a principled way?
What are the practical guidelines for building Consistency Models?

In this blog post, we aim to address these questions head-on. We will dive deep into the principles behind Consistency Models, breaking down the complexities and presenting a clear, intuitive understanding of how they work. By deconstructing the current approaches and identifying the key components, we will develop a streamlined training recipe termed Easy Consistency Tuning (ECT).

But that's not all! We will also share a remarkable finding that showcases the true potential of ECT. By training a 2-step generative model using ECT, we can surpass state-of-the-art (SoTA) Consistency Models while using only $1/32 \approx 3.2\%$ of the training compute and half the model size compared to the previous approaches. This is a significant leap forward, also outperforming all Diffusion Distillations by a wide margin.

Fig. 2. ****Comparison of learning schemes for efficient generative modeling. Easy Consistency Tuning (ECT) surpasses Consistency Distillation (CD) or Consistency Models trained from scratch (iCT) while requiring only 1/4 (using off-the-shelf models) + 1/32 (actual tuning cost) of the total training compute. ECT also significantly reduces the inference cost by 1/1000 compared to Diffusion Pretraining (Score SDE/DMs) while maintaining comparable generation quality.

Generative Modeling in a nutshell:

Diffusion Models: Powerful but slow at inference 🐌.