CS180 Project 5: Fun With Diffusion Models
Name: Shuai Victor Zhou
SID: 3035912996
Part A: The Power of Diffusion Models
Overview
In this part of the project, we utilize diffusion models to generate and create our own custom images.
Part 0: Setup
For this project, I used seed = 180.
Smaller inference steps gave us faster but less detailed results, whereas larger inference steps gave us more detailed images at the cost of taking longer. When we have a higher number of inference steps, we see the following effects:
- Our oil painting of the snowy mountain village is much clearer with higher quality, as trees are more detailed and coated with snow. The houses are also much more house-like.
- The man wearing the hat is actually much more distinguishably a person, and he has clear facial features as well as distinct facial hair. The hat is also very obvious, and casted shadows can be seen on the man.
- The rocket ship is now clearly depicted as what we’d think of when we think “rocket ship.” The colors are more diverse, the fire behind it has a nice gradient, and we’re also now obviously headed towards space.
Part 1: Sampling Loops
We use pretrained DeepFloyd denoisers to create new high quality images. From a clean image x0, we can add noise to get noisy images xt; this process is iteratively repeated until we reach xT (when t=T). On the opposite end, we can also predict the noise at each time t and remove it from the noisy image to eventually recover the original image.
Part 1.1: Implementing the Forward Process
Given a clean image x0, we can calculate the noisy image at time t as
Part 1.2: Classical Denoising
With the generated noisy images, we can use Gaussian blur filtering to attempt to remove the noise.
Part 1.3: One-Step Denoising
Once again using the three noisy images from before, we use the UNet to estimate the noise from the images. We can then remove the noise from the noisy image to estimate the original image.
Part 1.4: Iterative Denoising
Adding onto the previous part, we can iteratively denoise by running what we did in part 1.3 but for multiple iterative steps. To get the estimate at a time t’ after time t (for t’ < t), we can use the formula
where
Part 1.5: Diffusion Model Sampling
Doing the same thing but now starting off with random noise instead of a noisy Campanile, we can get the following 5 sampled images using prompt “a high quality photo.”
Part 1.6: Classifier Free Guidance
The images from part 1.5 weren’t of the highest quality; to improve this, we use Classifier-Free Guidance (CFG). We use two noise estimates, one conditioned on the text prompt (εc) and one that’s unconditional (εu). Our noise estimate, using these two, is then
where γ is what controls the strength of our CFG. Using γ = 7, we generate the following 5 sampled images again using the prompt “a high quality photo.”
Part 1.7: Image-to-Image Translation
Once again returning to our image of the Campanile, we can again call our iterative_denoise_cfg function, except this time using starting index of [1, 3, 5, 7, 10, 20]. This will create images that look closer and closer to the original image of the Campanile.
Part 1.7.1: Editing Hand-Drawn and Web Images
We can also perform the image-to-image translation with not just the Campanile, but also our own images.
Part 1.7.2: Inpainting
Using a binary mask m, we can generate images where parts are the same as the original, but everywhere else is different. This is accomplished by running the diffusion denoising loop, but at each step, we perform
Part 1.7.3: Text-Conditional Image-to-Image Translation
Now, instead of using our prompt of “a high quality photo,” we can use our own custom prompts to generate image-to-image translations.
Using prompt “a rocket ship”:
Using prompt “a lithograph of a skull”:
Using prompt “a photo of a hipster barista”:
Part 1.8: Visual Anagrams
Using similar methods as before, we can generate images that look like one thing when upright and another thing when upside down. We do this by utilizing two different prompts. The first prompt denoising image xt generates noise estimate ε1, and the second prompt denoises an upside down version of xt to generate noise estimate ε2. We then flip ε2 back upright and take a weighted average of the two noise estimates for our final noise estimate.
Usually, we’d have weight w = 0.5, but for our first visual anagram of the people around a campfire and an old man, my best result came with w = 0.7.
Part 1.9: Hybrid Images
To create images that seem like one thing when we’re close and another when we’re far away, we utilize low-pass and high-pass filters. Again, two noise estimates are created using separate prompts, but this time they’re combined like so:
The low-pass is done using a Gaussian blur of kernel size 33 and sigma 2, and the high-pass is the inverse.
Far: “a lithograph of a skull”
Close: “a lithograph of waterfalls”
Far: "a photo of a dog”
Close: "an oil painting of people around a campfire”
Far: "a rocket ship”
Close: "a man wearing a hat”
Part B: Diffusion Models from Scratch
Overview
In this second part of the project, we train our own diffusion models using MNIST.
Part 1: Training a Single-Step Denoising UNet
We first build a one-step denoiser.
Part 1.1: Implementing the UNet
We implement the following UNet with its UNet operations:
Part 1.2: Using the UNet to Train a Denoiser
In order to train our denoiser, we need to generate noisy images with
For σ = [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0], we get the following images in order.
Part 1.2.1: Training
We now train a denoiser to denoise images with σ = 0.5 on the clean image. We use batch size 256, and we train over our dataset for 5 epochs. We use hidden dimension 128 for our UNet and Adam optimizer with learning rate 1e-4.
Below is our loss curve.
Below is the results on a test set after the first and fifth epochs.
Part 1.2.2: Out-of-Distribution Testing
When we test on out-of-distribution noise levels, we get the following for σ = [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0], in order alternating between noisy and denoised.