What is Stable Diffusion?
With the advancement of AI in the Image and Video domain, one might come across a word called Stable Diffusion that can perform tasks such as Text-to-Image, Text-to-Video, Image-to-Video, Image-to-Image and so on. To understand Stable Diffusion as a whole, it all started as a cutting-edge text-to-image latent diffusion model developed collaboratively by researchers and engineers associated with CompVis, Stability AI, and LAION. The model originated from the research paper “High-Resolution Image Synthesis with Latent Diffusion Models” written by Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. The fundamental idea behind latent diffusion revolves around implementing the diffusion process within a lower-dimensional latent space, effectively addressing the challenges posed by memory and computational demands in high-resolution image synthesis.
Working and Pre-trained Models
Stable Diffusion is trained on 512×512 images from a particular part of the LAION-5B dataset. Importantly, it uses a fixed OpenAI’s CLIP ViT-L/14 text encoder to guide the model with text instructions. Despite its impressive features, the model stays lightweight with an 860M UNet and a 123M text encoder, making it run smoothly on regular consumer GPUs.
Since training the model is not affordable by everyone, one relies on pre-trained models. Now that we know, we are supposed to use pre-trained models, but where do we get the access to this model weights?
Hugging Face comes to the rescue. Hugging Face Hub is a cool place with over 120K+ models, 75K+ datasets, and 150K+ spaces (demo apps), all free and open to everyone.
Build Text To Image with HuggingFace Diffusers
This article will implement the Text 2 Image application using the Hugging Face Diffusers library. We will demonstrate two different pipelines with 2 different pre-trained Stable Diffusion models. Before we dive into code implementation, let us understand Stable Diffusion.
Contact Us